AI

NVIDIA Blackwell Tops Agentic AI Benchmark at 20x Hopper

NVIDIA Blackwell Tops Agentic AI Benchmark at 20x Hopper

NVIDIA logo — via Wikimedia Commons

NVIDIA’s Blackwell-based GB300 NVL72 platform tops the first industry agentic AI benchmark, AgentPerf, delivering 20x more concurrent agents per megawatt than Hopper-based HGX H200 systems, per Artificial Analysis’s inaugural test results.

What Is AgentPerf and Why Does It Matter?

Artificial Analysis has published the first industry benchmark purpose-built for agentic AI. The AgentPerf suite measures how many concurrent agents a system sustains while each agent chains dozens to hundreds of LLM calls, tool invocations, and growing context windows. This workload profile bears little resemblance to single-request throughput numbers that have dominated inference marketing.

In the inaugural round, the NVIDIA GB300 NVL72 platform posts the highest scores. It runs up to 20× more agents per megawatt than the Hopper-based HGX H200 system at both 20 and 60 tokens-per-second-per-agent service levels NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark.

AgentPerf drives the frontier mixture-of-experts (MoE) model DeepSeek V4 Pro through realistic agent loops that include tool compilation, execution, and retrieval. The benchmark reports two service-level objectives: 20 tokens per second per agent (responsive) and 60 tokens per second per agent (high-throughput).

It measures concurrent agents sustained per megawatt, a power-normalized metric that matters because agent fleets run continuously in production, not in burst benchmarks.

Standard benchmarks treat each request as an independent sprint. Agentic workloads are a relay race where a single user goal spawns a chain of model calls, code execution, and web searches. Each step passes expanding context to the next.

The multiplicative complexity of agentic workloads stresses memory bandwidth, inter-GPU communication, and scheduler efficiency in ways single-turn benchmarks never reveal. A single agent may invoke 50 or more model calls across planning, tool use, and generation phases.

At scale, hundreds of agents sit at different pipeline stages simultaneously. This creates a scheduling problem that favors systems with disaggregated prefill/decode and high-bandwidth fabric. AgentPerf captures this by reporting sustained concurrency under power constraints — a direct proxy for operational cost.

GB300 NVL72: Rack-Scale Codesign Shows Up in the Numbers

The GB300 NVL72 connects 72 Blackwell GPUs into a single NVLink domain within one rack. This lets a model the size of DeepSeek V4 Pro distribute expert shards across the full fabric without PCIe bottlenecks. Three stack-level optimizations translate that topology into the 20× efficiency gap:

Layer Mechanism Agentic Impact
Interconnect 72-GPU NVLink domain, 1.8 TB/s all-to-all MoE expert routing stays on-fabric; no cross-rack hops
Kernel Runtime CUDA kernels overlap communication & compute Expert-to-expert handoff latency absorbed, not added
Serving Engine TensorRT-LLM decouples prefill from decode Input processing and token generation scale independently

TensorRT-LLM’s disaggregated prefill/decode is particularly relevant for agent workloads. Prompt-heavy planning phases (prefill) and long token-generation tails (decode) schedule on separate GPU pools.

This design raises utilization when hundreds of agents are at different pipeline stages simultaneously.

The Ecosystem Context: Agents Need More Than Fast Iron

Benchmark wins do not deploy themselves. The same week AgentPerf results dropped, OpenAI launched its Partner Network with a $150M investment targeting 300,000 certified consultants by year-end Introducing the OpenAI Partner Network. The program pairs frontier models with systems-integration, workflow-redesign, and change-management expertise.

Separately, OpenAI Academy added three practitioner courses: AI Foundations, Applied AI Foundations, and Agents and Workflows. The courses were built in partnership with BCG, Accenture, and BBVA to turn prompting into repeatable, auditable workflows New OpenAI Academy courses for the next era of work.

These parallel moves confirm what the benchmark implies: agentic value accrues at the system-integration layer, not the model layer alone.

Practical Takeaways for Builders and Operators

  • Capacity planning: Size clusters for sustained concurrent agents per megawatt, not peak single-request throughput. AgentPerf’s power-normalized metric maps directly to operational expenditure (OpEx) for 24/7 agent fleets.
  • Model selection: Mixture-of-experts (MoE) architectures including DeepSeek V4 Pro, Llama 4, and Nemotron 3 Ultra exploit NVL72’s 1.8 TB/s all-to-all NVLink fabric. Dense models will not show the same scaling advantage on the platform.
  • Software stack: TensorRT-LLM’s disaggregated prefill/decode serving and CUDA graph capture are now required for agent workloads. Verify your inference server supports both features before deployment.
  • Observability: Instrument per-agent latency distributions (p50/p99) across planning, tool-call, and generation phases. Aggregate tokens per second metrics hide tail latency that breaks user trust in production agent flows.
  • Procurement: Evaluate rack-scale systems including NVL72, AMD MI300X/MI325X platforms, and Intel Gaudi 3 clusters on agent-per-watt efficiency rather than GPU count or FP8 TFLOPs.
We may earn commission from affiliate links at no extra cost to you. Last updated: Jun 30, 2026.
Aira

Founding Editor and Publisher of ZBrandCo, covering artificial intelligence, open-source software, and the developer tools people actually use. Signal over hype: every story starts from a primary source and explains why it matters. ZBrandCo runs no paid reviews and no affiliate links. Tips and corrections: editorial@zbrandco.com.