Hugging Face has published a new open benchmark built explicitly for benchmarking open models on your own tooling, a framework that measures full agentic workflow efficiency rather than only raw task accuracy to address a widespread gap in standard AI evaluation practices.
Unlike most existing benchmarks that only verify if an agent completes a task correctly, this harness tracks four concrete, measurable efficiency metrics, and is adaptable to any command-line operable tool, not just the Hugging Face transformers library.
All test runs execute as individual Hugging Face Jobs on identical hardware to eliminate cross-environment variability, with the full open-source harness published for public use on the company’s public repository. 1
The benchmark tracks four specific efficiency metrics to quantify agent performance beyond binary pass/fail accuracy: total token usage, end-to-end task latency, number of failed execution attempts, and total compute cost incurred during the workflow.
To avoid skewed comparative results from nested context, the test suite uses three non-nested tiers of tool access: the bare tier, which provides only the base pip-installed library with no additional context; the clone tier, which adds the full source code of the tool to the agent’s working directory; and the skill tier, which provides only curated documentation and task-specific examples, with no access to the full source tree.
Hugging Face’s engineering team found that some models perform better on the clone tier than the skill tier, depending on their relative ability to parse raw code versus structured documentation, and all test runs are fanned out across the Hugging Face Jobs platform to run in parallel on identical hardware, eliminating variability between test environments that would make cross-model comparisons unreliable. 1
The benchmark’s initial test suite focuses on deterministic, output-verifiable tasks including text classification, image captioning, and audio transcription, which allow for exact match evaluation and eliminate the variability of model-as-judge scoring systems.
Early tests of this curated task set found that agents using a task-specific transformers CLI used 1.3–1.8x fewer tokens than agents with only the base pip-installed library available, with peak efficiency gains of 6x for simple text classification tasks.
These results were informed by a prior redesign of the Hugging Face hf CLI for agent use, which optimized the tool’s command structure and documentation for agent parsing, delivering 1.3–1.8x lower token usage for most tasks, with gains up to 6x for simple operations; Hugging Face built the new benchmark to test if those efficiency gains would generalize to the far more complex transformers library, which has a larger API surface and broader set of use cases. 1
This shift in how agents interact with software creates new performance requirements for library maintainers that standard benchmarks do not capture, as a clunky API or stale documentation that causes minor friction for human developers can send agents down longer, more expensive execution paths, or lead them to bypass the library entirely and rewrite logic from scratch, adding hidden cost to production agent deployments.
The benchmark is built to help maintainers quantify how specific design choices impact agent performance, so they can prioritize changes that reduce agent workflow friction, unlike most existing benchmarks that only measure final task accuracy and ignore the full cost and friction of an agent’s end-to-end workflow.
Efficiency gains measured by this benchmark translate directly to lower compute costs and faster response times for production agent deployments, making it a practical tool for teams optimizing libraries for AI agent use. 1
