A new self-evolving agent framework improves legal case retrieval by iteratively learning query-rewriting rules without parameter training, outperforming human-designed baselines on the LeCaRD-v2 benchmark, according to a paper submitted to arXiv on June 15 and accepted for ACL 2026 When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval.
The framework equips an LLM-based agent with an automatic evaluation environment to create rewriting rules, plan validation experiments over rule combinations, and eliminate ineffective rules using historical feedback. On the Chinese LeCaRD-v2 benchmark, it surpasses non-evolutionary baselines including human-crafted rules and greedy selection, with gains most pronounced when driven by a high-capacity LLM When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval.
Self-Evolving Framework Automates Rule Discovery
Legal case retrieval has long relied on lexical matching because dense retrievers struggle with the precise terminology alignment that legal queries demand. BM25 remains a strong baseline in the domain, motivating the authors to enhance it through rule-driven query rewriting rather than replacing it with a learned dense encoder.
Their framework gives an LLM agent an evaluation loop: the agent proposes rewriting rules, runs controlled experiments combining those rules, measures retrieval effectiveness, and discards rules that fail to improve scores. This cycle repeats without any gradient updates to the LLM or the retriever When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval.
The agent’s environment executes each candidate rule set against the benchmark, producing a concrete score that becomes the feedback signal for the next iteration. By treating rule discovery as an experimental science rather than a one-shot prompting task, the system avoids the brittleness of static human-authored rule lists.
The paper notes that the LLM’s ability to reason over prior experimental outcomes — essentially learning from its own ablation studies — is a primary driver of the final rule set’s quality When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval.
LeCaRD-v2 Benchmark Shows Measurable Gains
Experiments run on LeCaRD-v2, a Chinese legal case retrieval benchmark containing thousands of query-case pairs, demonstrate that the self-evolving agent consistently outperforms two non-evolutionary baselines: a set of human-designed rewriting rules and a greedy rule-selection strategy that picks rules individually without combinatorial validation.
The margin widens when the core LLM has higher capacity, suggesting that the evolutionary loop amplifies the model’s inherent reasoning and planning skills. Exact numeric lifts are reported in the paper’s tables, which compare nDCG and recall metrics across the rule sets generated at each evolution step When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval.
The benchmark choice is significant because LeCaRD-v2 reflects real-world legal language complexity — statutes, precedents, and fact patterns that require exact term overlap. Dense retrievers trained on general corpora often miss these low-frequency legal tokens, whereas BM25 augmented with learned rewriting rules preserves lexical precision while expanding recall. The framework’s zero-training requirement also means it can be deployed atop any existing BM25 index without re-indexing or GPU inference costs at query time.
LLM Capacity Drives Evolution Effectiveness
Ablation experiments reveal that the self-evolution mechanism depends critically on the LLM’s capacity to synthesize negative feedback. When a smaller model powers the agent, the rule set converges to a local optimum that resembles the greedy baseline.
With a high-capacity model, the agent identifies subtle rule interactions — such as a broadening rule that hurts precision unless paired with a filtering rule — and eliminates the harmful combinations in later iterations.
The authors attribute this to the LLM’s “intrinsic knowledge of rule elimination,” meaning the model can predict which rule types are likely to conflict before running the experiment, reducing wasted evaluation cycles When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval.
This finding aligns with a broader trend in agentic LLM research: long-horizon task performance scales with the model’s ability to maintain and exploit a working memory of past actions. Recent work on 1M-context coding agents shows similar dynamics, where the model’s capacity to reference earlier debugging steps determines whether it can complete multi-hour engineering tasks GLM-5.2: Built for Long-Horizon Tasks.
The legal retrieval agent operates on a shorter horizon but faces an analogous challenge — credit assignment across a sequence of rule proposals and validations.
Implications for Legal AI and Retrieval Systems
The self-evolving approach suggests a practical path for production legal search systems: keep the battle-tested BM25 index, layer an LLM-driven rule optimizer that runs offline, and update the rule set periodically as case law evolves. Because the framework requires no labeled relevance judgments beyond the benchmark’s existing annotations, it sidesteps the expensive annotation bottleneck that blocks many dense-retrieval deployments.
Teams can also audit the final rule list — each rule is a human-readable text transformation — providing transparency that opaque dense vectors lack.
Evaluation infrastructure will matter as these systems mature. Frameworks that support rapid, reproducible benchmarking across model checkpoints — such as the olmo-eval workbench released this month — enable developers to measure whether a new LLM actually improves the downstream retrieval metric before committing to a production rollout olmo-eval: An evaluation workbench for the model development loop. The legal retrieval paper’s experimental design, which treats each evolution step as a controlled experiment, mirrors the per-checkpoint evaluation discipline that such tools encourage.
Bottom line: A self-evolving LLM agent that learns BM25 query-rewriting rules through automated experimentation beats static human-crafted rules on LeCaRD-v2, offering a zero-training, auditable upgrade path for legal search systems that hinges on the core model’s reasoning capacity.
