HiL-Bench (Human-in-Loop Benchmark)
HiL-Bench tests whether agents know when to ask for help, measuring if they recognize missing or ambiguous information and ask targeted clarifying questions instead of guessing.
Claude Opus 4.7
27.67±5.32
Claude Opus 4.6
24.33±5.16
GLM-5.1
21.00±4.96