A new training-free method called DivInit improves agentic search by eliminating query redundancy in the first turn of parallel sampling, delivering average gains of five to seven points on multi-hop question-answering benchmarks across five open-weight models at matched compute.
Standard parallel sampling hits diminishing returns because models issue nearly identical first queries across rollouts, causing overlapping evidence retrieval that constrains all subsequent turns. DivInit addresses this by drawing n candidate queries from a single model call, selecting k diverse seeds, and launching them as independent trajectories—improving coverage without retraining or added latency.
The Redundancy Bottleneck in Parallel Agentic Search
Test-time scaling for agentic systems typically pursues two levers: depth, meaning more turns and tokens per trajectory, and breadth, meaning more parallel rollouts. The paper submitted to arXiv on 15 June 2026 shows that breadth scaling via standard parallel sampling yields diminishing returns, and traces the cause to the first turn [^1].
When a model samples k independent initial queries, those queries tend to cluster semantically. The retrieval system then returns overlapping evidence passages, and every subsequent turn in every rollout conditions on that shared context. The effective diversity of the ensemble collapses before the second turn begins.
This dynamic mirrors a broader pattern in agentic evaluation: benchmarks often reward the ability to gather distinct evidence chains, not merely to reason over a single retrieved set. Tools like olmo-eval now treat multi-turn, tool-using trajectories as first-class evaluation targets, making first-turn diversity a measurable lever rather than an implementation detail [^6].
How DivInit Diversifies First-Turn Queries
DivInit intervenes only at the first turn. Instead of sampling k independent queries from k separate model calls, it draws n candidates from a single call, picks k diverse seeds from that pool using a lightweight diversity metric, and runs those k seeds as parallel trajectories. The method adds no trainable parameters, no additional model forward passes beyond the initial candidate generation, and no specialized retrieval infrastructure. Code implementing the approach is available alongside the preprint [^1].
The design reflects a growing emphasis on training-free test-time interventions that rearrange existing compute rather than demanding more of it. Similar philosophy appears in long-context architectures like GLM-5.2, where IndexShare reuses attention indices across layers to make 1M-token contexts practical without proportional FLOP increases [^5]. In both cases, the gain comes from restructuring how the model spends its existing budget.
Benchmark Gains Across Five Models and Eight Tasks
The authors evaluate DivInit across five open-weight models and eight benchmarks, reporting consistent improvements over standard parallel sampling at matched compute. On multi-hop QA tasks, the average gain ranges from five to seven points. The paper spans 15 pages with eight figures and is currently under review at EMNLP 2026 [^1].
These results matter because multi-hop QA remains a stress test for agentic retrieval: the model must decide what to search for, retrieve it, synthesize it, and then decide what to search for next—all while avoiding the trap of re-retrieving the same evidence. DivInit’s gains suggest that the first decision—what to search for initially—carries outsized influence over the entire trajectory.
For teams building how test-time scaling reshapes agentic workflows, the implication is clear: diversity at the root of the search tree propagates further than diversity at the leaves.
Implications for Test-Time Scaling Strategies
The finding reframes how practitioners should allocate test-time compute. If the first turn dominates trajectory diversity, then investing in a richer, more diverse set of initial queries yields more marginal benefit than simply adding more rollouts seeded from the same distribution.
This holds particular weight for long-horizon coding agents, where a single missed evidence chain can derail hours of downstream work. Frameworks that evaluating long-context models on multi-hop reasoning can now treat first-turn query diversity as a controlled variable rather than a stochastic byproduct.
Bottom line: DivInit proves that a single, well-structured intervention at the first turn of agentic search outperforms naive parallel scaling—offering a 5–7 point lift on multi-hop QA across five models without retraining, and redirecting test-time compute toward the decision that matters most.
