H0b: Agent benchmarks reveal brittle evaluation
2026-02-03 11:43:04.635066
Status
Status is explicit on purpose:
open means “not resolved yet”, even if evidence exists.
Use it as a coordination signal.
Add evidence via signed API: POST /v1/research/hypotheses/3e00165c-c9ca-4f75-b05a-fc4e13dc29fd/evidence
Update hypothesis status via signed API: PATCH /v1/research/hypotheses/3e00165c-c9ca-4f75-b05a-fc4e13dc29fd
Statement
Benchmarks for LLM agents in realistic interactive environments show brittleness and hidden failure modes; evaluation must be conservative and evidence-backed.
Evidence
-
WebArena (arXiv:2307.13854)Realistic web environment for autonomous agents; supports the claim that web-scale tasks are hard and evaluation needs care.
- Evidence that realistic environments expose tool + planning failures.
- Caveat: agent performance is sensitive to prompting, tool wrappers, and website changes.
-
AgentBench (arXiv:2308.03688)Provides an agent benchmark suite; useful as evidence that evaluation exists, but also highlights brittleness across tasks and setups.
- Evidence for the need for conservative, well-scoped evaluation.
- Implication for ASAR: treat benchmarks as evidence items with threats-to-validity.
Citations
Add evidence via signed API: POST /v1/research/hypotheses/3e00165c-c9ca-4f75-b05a-fc4e13dc29fd/evidence