New arXiv research published in June 2026 introduces CORE-Bench v1.1, an updated computational reproducibility benchmark that extends agent benchmarking past accuracy saturation to measure efficiency, reliability, and human-agent collaboration performance for production AI systems CORE-Bench v1.1 preprint. The work, released to arXiv in June 2026, pairs the updated benchmark with a new out-of-distribution task suite to capture these understudied performance metrics CORE-Bench v1.1 preprint.
The core finding rejects the standard industry practice of retiring benchmarks once model accuracy hits a performance ceiling CORE-Bench v1.1 preprint. Instead, the research team demonstrates that even after accuracy saturation, CORE-Bench v1.1 remains useful for measuring six understudied agent traits: construct validity (including shortcut exploitation), out-of-distribution generalizability, efficiency, reliability, the relative performance of the base model versus its surrounding scaffold, and uplift from human-agent collaboration CORE-Bench v1.1 preprint.
Expanded CORE-Bench v1.1 Advances Agent Benchmarking Beyond Accuracy-Only Metrics
When the original CORE-Bench Hard was first released, low-performing agents regularly failed on basic computational reproducibility tasks CORE-Bench v1.1 preprint. This high baseline failure rate made it impossible to spot subtler failure modes, such as agents taking shortcuts to inflate scores without completing full reproduction workflows CORE-Bench v1.1 preprint.
The updated v1.1 release, paired with the new CORE-Bench out-of-distribution task suite, is designed to surface these construct validity threats that only emerge as agent capabilities improve CORE-Bench v1.1 preprint. Prior to this update, top-performing agents on the original CORE-Bench Hard regularly scored near-perfect accuracy on core task sets CORE-Bench v1.1 preprint. This led many AI development teams to write off the benchmark as useless for future evaluation, as it no longer discriminated between leading model performances CORE-Bench v1.1 preprint.
Saturated Accuracy Scores Mask Meaningful Performance Gaps
In testing, the research team found that top agents now hit near-perfect accuracy on core CORE-Bench Hard tasks CORE-Bench v1.1 preprint. Even so, the expanded v1.1 suite produces meaningful, discriminative signals for four understudied performance traits: efficiency, reliability, base model performance, and scaffold performance CORE-Bench v1.1 preprint. These metrics are completely invisible when only accuracy is measured CORE-Bench v1.1 preprint.
The researchers also isolated the relative contribution of the base model versus the surrounding scaffold — the prompting, tooling, and workflow layers built around the model — by running identical tasks with and without standard agent scaffolds CORE-Bench v1.1 preprint. This metric is impossible to capture with accuracy-only benchmarking, making the expanded suite a unique tool for production agent system evaluation CORE-Bench v1.1 preprint.
Human-Agent Collaboration Cuts Reproducibility Task Time In Half
The paper also includes results from a small-scale randomized experiment measuring human-agent collaboration on real-world computational reproducibility tasks CORE-Bench v1.1 preprint. The team found a statistically significant ~2x speedup for paired human-agent teams compared to humans working alone on these tasks CORE-Bench v1.1 preprint.
