TL;DR: olmo-eval (AllenAI, June 2026) is an open-source evaluation workbench for the entire LLM development loop — not just final model scoring. Key differentiators: lightweight direct execution by default (containers only when needed), per-question pairwise comparison between checkpoints, and minimum detectable effect (MDE) reporting so you know if a 0.3% accuracy bump is real or noise. This guide walks through install → task definition → suite runs → checkpoint comparison.
What You’ll Learn
- Install olmo-eval and run your first benchmark in <5 minutes
- Define custom tasks (benchmarks) and group them into suites
- Run evaluations with direct execution (fast) vs. sandboxed execution (for code/tools)
- Compare two model checkpoints question-by-question with statistical rigor (MDE)
- Integrate into a continuous evaluation workflow for model development
What You Need (Prerequisites)
| Requirement | Details | Where to Get |
|---|---|---|
| Python | 3.10+ | python.org |
| GPU (optional) | CUDA for local model inference; CPU works for API-backed models | NVIDIA / cloud |
| Model access | Local (Ollama, vLLM, HF transformers) or API (OpenAI, Anthropic, etc.) | Varies |
| AllenAI API key | For OLMES benchmark datasets (free tier available) | allenai.org |
| Docker (optional) | Only for sandboxed tool-use benchmarks | docker.com |
Skill level: Intermediate — comfortable with Python, CLI, and basic LLM concepts.
Step-by-Step Instructions
Step 1: Install olmo-eval
# Create isolated environment (recommended)
python -m venv olmo-eval-env
source olmo-eval-env/bin/activate
# Install from PyPI (includes core + common benchmarks)
pip install olmo-eval
# Verify install
olmo-eval --help
Note: olmo-eval is pure Python with minimal dependencies. No Docker required unless you run code-execution benchmarks. The codebase is open-source (Apache 2.0) at github.com/allenai/olmo-eval.
Step 2: Configure Your Model Provider
olmo-eval supports multiple backends. Create a config.yaml:
# config.yaml
model:
provider: "hf" # or "openai", "anthropic", "ollama", "vllm"
name: "meta-llama/Llama-3.2-3B-Instruct"
# For API providers:
# api_key_env: "OPENAI_API_KEY"
# Optional: Default harness settings
harness:
default_batch_size: 16
default_max_tokens: 512
For local models (Ollama example):
model:
provider: "ollama"
name: "llama3.2:3b"
base_url: "http://localhost:11434"
For vLLM (high-throughput local):
# Terminal 1: Start vLLM server
vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8000
# Terminal 2: Configure olmo-eval
model:
provider: "vllm"
name: "meta-llama/Llama-3.2-3B-Instruct"
base_url: "http://localhost:8000/v1"
Step 3: Run a Built-In Benchmark (Sanity Check)
olmo-eval ships with OLMES (Open Language Model Evaluation Standard) suites — the benchmark standard published by AllenAI in 2024 (OLMES Paper):
# List available suites
olmo-eval list-suites
# Run a quick suite (MMLU subset, GSM8K, etc.)
olmo-eval run-suite olmes_core --config config.yaml --output ./results/run-001
# Output: structured JSON + summary table
Expected output (truncated):
Suite: olmes_core
├── mmlu_stem: 0.623 ± 0.018 (n=1200)
├── gsm8k: 0.714 ± 0.021 (n=1319)
├── humaneval_pass@1: 0.487 ± 0.024 (n=164)
└── bbh: 0.591 ± 0.019 (n=2300)
Overall: 0.604 ± 0.012
Runtime: 4m 23s (direct execution, no containers)
Direct execution = default. Benchmarks needing only Q&A run as Python processes — fast, cheap, no container overhead. The codebase is open-source (Apache 2.0) at github.com/allenai/olmo-eval. Per AllenAI’s announcement, this lightweight path is the primary mode; containers spin up only for tool-use benchmarks.
Step 4: Define a Custom Task (Your Benchmark)
Create my_tasks.py:
# my_tasks.py
from olmo_eval.common.formatters import ChatFormatter
from olmo_eval.common.metrics import AccuracyMetric
from olmo_eval.common.scorers import ExactMatchScorer
from olmo_eval.common.types import Instance, SamplingParams
from olmo_eval.data import DataLoader, DataSource
from olmo_eval.evals.tasks.common import Task, register
@register("my_custom_qa")
class MyCustomQA(Task):
"""Custom QA benchmark from JSONL."""
data_source = DataSource(path="s3://my-bucket/benchmarks/custom_qa.jsonl", split="test")
formatter = ChatFormatter()
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)
metrics = (AccuracyMetric(scorer=ExactMatchScorer),)
@property
def instances(self):
loader = DataLoader()
for idx, doc in enumerate(loader.load(self.config.get_data_source())):
yield Instance(
question=doc["question"],
gold_answer=doc["answer"],
metadata={"id": doc.get("id", f"custom_qa_{idx}")},
)
# Register a few-shot variant
from olmo_eval.evals.tasks.common import register_variant
register_variant("my_custom_qa", "3shot", num_fewshot=3)
Data format (JSONL):
{"question": "What is the capital of France?", "answer": "Paris", "id": "geo_001"}
{"question": "2 + 2 = ?", "answer": "4", "id": "math_001"}
Run your custom task:
olmo-eval run-task my_custom_qa --config config.yaml --output ./results/custom-001
Step 5: Run Tool-Use / Code-Execution Benchmarks (Sandboxed)
For benchmarks requiring code execution (HumanEval, MBPP, custom coding tasks):
# Enable Docker sandbox mode
olmo-eval run-suite olmes_coding --config config.yaml \
--harness-mode docker \
--output ./results/coding-001
What happens under the hood:
– olmo-eval spins up asynchronous Docker sandboxes (parallel by default)
– Each problem runs in isolation; model output → code execution → result fed back
– Capability-based routing: Docker for local, Modal for cloud (configure in harness)
Step 6: Compare Two Checkpoints (The Killer Feature)
This is where olmo-eval shines for model development — not just final scoring.
# Run same suite on checkpoint A (baseline)
olmo-eval run-suite olmes_core --config config_checkpoint_A.yaml --output ./results/ckpt-A
# Run same suite on checkpoint B (your experiment)
olmo-eval run-suite olmes_core --config config_checkpoint_B.yaml --output ./results/ckpt-B
# Pairwise comparison: question-by-question, with MDE
olmo-eval compare ./results/ckpt-A ./results/ckpt-B --output ./results/comparison
Comparison output includes:
| Metric | Checkpoint A | Checkpoint B | Delta | MDE | Significant? |
|---|---|---|---|---|---|
| mmlu_stem | 0.623 | 0.641 | +0.018 | 0.015 | ✅ Yes |
| gsm8k | 0.714 | 0.709 | -0.005 | 0.018 | ❌ No (within noise) |
| humaneval | 0.487 | 0.512 | +0.025 | 0.022 | ✅ Yes |
Minimum Detectable Effect (MDE) tells you the smallest difference reliably distinguishable from sampling noise (AllenAI blog). If delta < MDE, the change is not statistically significant — even if the number looks positive.
Per-question breakdown (unique to olmo-eval, per AllenAI documentation):
Question 042 (MMLU stem): A=✓ B=✓ → Same
Question 043 (MMLU stem): A=✗ B=✓ → B won
Question 044 (MMLU stem): A=✓ B=✗ → A won
...
Net: B wins 127, A wins 98, Tie 975
This surfaces which specific capabilities improved/regressed — impossible with aggregate scores alone.
Step 7: Continuous Evaluation Workflow (Putting It Together)
Recommended development loop:
#!/bin/bash
# eval_loop.sh — run after each training checkpoint
CHECKPOINT=$1
SUITE="olmes_core"
CONFIG_BASE="config.yaml"
# 1. Generate config for this checkpoint
sed "s|model_name_placeholder|$CHECKPOINT|g" $CONFIG_BASE > config_${CHECKPOINT}.yaml
# 2. Run evaluation (direct execution, fast)
olmo-eval run-suite $SUITE --config config_${CHECKPOINT}.yaml --output ./results/${CHECKPOINT}
# 3. Compare against previous checkpoint (if exists)
PREV=$(ls -1t results/ | head -2 | tail -1)
if [ -n "$PREV" ] && [ "$PREV" != "$CHECKPOINT" ]; then
olmo-eval compare ./results/$PREV ./results/${CHECKPOINT} --output ./results/compare_${PREV}_vs_${CHECKPOINT}
echo "Comparison saved: ./results/compare_${PREV}_vs_${CHECKPOINT}"
fi
Usage:
./eval_loop.sh checkpoint-5000
./eval_loop.sh checkpoint-10000
./eval_loop.sh checkpoint-15000
# Each run: ~4 min (direct) → comparison → decision: continue training or pivot
Complete Workflow Diagram
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ New Checkpoint │────▶│ olmo-eval run │────▶│ Structured │
│ (or model ver) │ │ (direct exec) │ │ Results JSON │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
│
┌─────────────────┐ │
│ olmo-eval │◀─────────────────┘
│ compare │
│ (pairwise + │
│ MDE) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Continue │ │ Pivot │ │ Deploy │
│ Training │ │ (regress)│ │ (pass) │
└──────────┘ └──────────┘ └──────────┘
Caption: olmo-eval continuous evaluation loop — Source: Original diagram based on AllenAI documentation
Troubleshooting & FAQ
| Error / Symptom | Cause | Fix |
|---|---|---|
ModuleNotFoundError: olmo_eval |
Not installed in active env | pip install olmo-eval in correct venv |
No such task: my_custom_qa |
Task file not imported | export OLMO_EVAL_TASKS=my_tasks.py or place in olmo_eval/tasks/ |
| Docker sandbox fails | Docker not running / permissions | sudo systemctl start docker; add user to docker group |
| Comparison shows all “within MDE” | Too few questions / high variance | Increase test set size; run multiple seeds |
| OOM on local model | Model too large for GPU | Use smaller model, enable quantization, or use API backend |
Q: How does olmo-eval differ from lm-eval-harness?
A: lm-eval-harness is built for final model benchmarking (reproducible, containerized, public leaderboards). olmo-eval is built for development-loop evaluation (fast direct exec, per-question comparison, MDE, changing models constantly) (AllenAI blog; OLMES paper).
Q: Can I use olmo-eval with proprietary models (GPT-4, Claude)?
A: Yes — configure provider: "openai" or "anthropic" with API key. Rate limits apply.
Q: What benchmarks are included out of the box?
A: OLMES suites: olmes_core (MMLU, GSM8K, HumanEval, BBH), olmes_coding (HumanEval, MBPP, LiveCodeBench), olmes_reasoning (GPQA, MATH, ARC). Full list: olmo-eval list-suites.
Q: Does olmo-eval support multi-turn / agent evaluations?
A: Yes — via harness scaffolding and sandbox tool-use mode. Define custom tasks with multi-step interaction logic.
Quick Checklist (Copy-Paste)
[ ] Python 3.10+ env created and activated
[ ] olmo-eval installed (`pip install olmo-eval`)
[ ] Model provider configured (local Ollama/vLLM or API)
[ ] Config YAML written (config.yaml)
[ ] Sanity check: `olmo-eval run-suite olmes_core` runs without error
[ ] Custom task defined (my_tasks.py) and registered
[ ] Custom task runs: `olmo-eval run-task my_custom_qa`
[ ] Sandbox mode tested for code benchmarks (if needed)
[ ] Two checkpoints compared: `olmo-eval compare results/A results/B`
[ ] MDE understood: delta < MDE = not significant
[ ] Continuous eval script integrated into training loop
Bottom Line
olmo-eval is the right tool if you’re actively developing LLMs and need fast, statistically rigorous feedback between checkpoints. Its direct-execution default makes it 10-50x faster than container-heavy alternatives for standard Q&A benchmarks, and the per-question MDE comparison catches real improvements that aggregate scores miss. If you only need final-model leaderboard scores once, lm-eval-harness remains the standard. For the development loop, olmo-eval is purpose-built.
Next step: Install it, run olmes_core on your current checkpoint, and set up the comparison script. The first run takes ~4 minutes.
Inline Sources
- AllenAI / Hugging Face Blog (primary): “olmo-eval: An Evaluation Workbench for the Model Development Loop” (June 12, 2026) — https://huggingface.co/blog/allenai/olmo-eval
- Code Repository: AllenAI GitHub — https://github.com/allenai/olmo-eval (Apache 2.0)
- OLMES Paper: “OLMES: Open Language Model Evaluation Standard” (2024) — https://arxiv.org/abs/2401.03597
