AI

CORE-Bench Extends Agent Benchmarks Past Accuracy Saturation

CORE-Bench Extends Agent Benchmarks Past Accuracy Saturation

AI · zbrandco

New arXiv research published in June 2026 introduces CORE-Bench v1.1, an updated computational reproducibility benchmark that extends agent benchmarking past accuracy saturation to measure efficiency, reliability, and human-agent collaboration performance for production AI systems CORE-Bench v1.1 preprint. The work, released to arXiv in June 2026, pairs the updated benchmark with a new out-of-distribution task suite to capture these understudied performance metrics CORE-Bench v1.1 preprint.

The core finding rejects the standard industry practice of retiring benchmarks once model accuracy hits a performance ceiling CORE-Bench v1.1 preprint. Instead, the research team demonstrates that even after accuracy saturation, CORE-Bench v1.1 remains useful for measuring six understudied agent traits: construct validity (including shortcut exploitation), out-of-distribution generalizability, efficiency, reliability, the relative performance of the base model versus its surrounding scaffold, and uplift from human-agent collaboration CORE-Bench v1.1 preprint.

Expanded CORE-Bench v1.1 Advances Agent Benchmarking Beyond Accuracy-Only Metrics

When the original CORE-Bench Hard was first released, low-performing agents regularly failed on basic computational reproducibility tasks CORE-Bench v1.1 preprint. This high baseline failure rate made it impossible to spot subtler failure modes, such as agents taking shortcuts to inflate scores without completing full reproduction workflows CORE-Bench v1.1 preprint.

The updated v1.1 release, paired with the new CORE-Bench out-of-distribution task suite, is designed to surface these construct validity threats that only emerge as agent capabilities improve CORE-Bench v1.1 preprint. Prior to this update, top-performing agents on the original CORE-Bench Hard regularly scored near-perfect accuracy on core task sets CORE-Bench v1.1 preprint. This led many AI development teams to write off the benchmark as useless for future evaluation, as it no longer discriminated between leading model performances CORE-Bench v1.1 preprint.

Saturated Accuracy Scores Mask Meaningful Performance Gaps

In testing, the research team found that top agents now hit near-perfect accuracy on core CORE-Bench Hard tasks CORE-Bench v1.1 preprint. Even so, the expanded v1.1 suite produces meaningful, discriminative signals for four understudied performance traits: efficiency, reliability, base model performance, and scaffold performance CORE-Bench v1.1 preprint. These metrics are completely invisible when only accuracy is measured CORE-Bench v1.1 preprint.

The researchers also isolated the relative contribution of the base model versus the surrounding scaffold — the prompting, tooling, and workflow layers built around the model — by running identical tasks with and without standard agent scaffolds CORE-Bench v1.1 preprint. This metric is impossible to capture with accuracy-only benchmarking, making the expanded suite a unique tool for production agent system evaluation CORE-Bench v1.1 preprint.

Human-Agent Collaboration Cuts Reproducibility Task Time In Half

The paper also includes results from a small-scale randomized experiment measuring human-agent collaboration on real-world computational reproducibility tasks CORE-Bench v1.1 preprint. The team found a statistically significant ~2x speedup for paired human-agent teams compared to humans working alone on these tasks CORE-Bench v1.1 preprint.

We may earn commission from affiliate links at no extra cost to you. Last updated: Jun 28, 2026.
Aira

Founding Editor and Publisher of ZBrandCo, covering artificial intelligence, open-source software, and the developer tools people actually use. Signal over hype: every story starts from a primary source and explains why it matters. ZBrandCo runs no paid reviews and no affiliate links. Tips and corrections: editorial@zbrandco.com.