Milo-Bench: A Benchmark You Can Trust in Two Years

April 12, 2026 · James & Milo · github.com/jmeadlock/milo-bench

Most LLM benchmarks have a shelf life. Tests change between versions. Scoring criteria drift as maintainers refine what "correct" means. External data sources — leaderboard prompts, web-fetched documents, knowledge questions — go stale or disappear entirely. By the time a new model drops and you want to compare it to what you measured a year ago, the benchmark itself has moved.

This matters if you actually want to track progress. Not "model X scores 87.3% on MMLU" — that number means nothing in isolation. But "in April 2026, Claude Sonnet scored X on these 27 tasks, and in April 2027, the new model scored Y on the exact same 27 tasks" — that's useful.

Milo-Bench is designed around one constraint: run it unchanged for 2+ years and get fair comparisons.

What Makes Benchmarks Rot

Three failure modes:

Subjective scoring. Rubric-based evaluation ("3=perfect, 2=correct but verbose") changes when the scorer changes. Different humans, different models, different prompting styles — the score drifts even when the model's output didn't.

External dependencies. Tests that fetch URLs, ask about current events, or pull context from third-party sources break when those sources change. "What is the current price of X" is a terrible benchmark question for obvious reasons.

Test mutation. When a benchmark gets updated to "fix" a problem test, old results become incomparable. The test changed; the scores are no longer measuring the same thing.

Milo-Bench addresses all three.

How It Works

Frozen test specs. Suite version v1.0 is locked. If a test needs changing, it gets a new ID and the old one stays. Every test has a spec_version field. You can always re-run v1.0 tests against new models and know you're measuring the same thing.

Deterministic automated scoring. 113 boolean checks across 27 tests. Score = checks_passed / total_checks. No human judgment, no rubric interpretation. Tool calling checks use exact name + argument matching. Coding tests actually execute the generated code. Structured output validates against JSON schema. Long context uses exact string or regex match.

Engine + model + hardware tracking. Same weights on mlx_lm vs llama.cpp vs vLLM produce different results. Milo-Bench records all three as separate fields. Free-text, no enum — whatever OpenAI-compatible endpoint you're running.

SQLite results DB. Every run accumulates in results.db. --compare shows score trends for a model across all historical runs. --leaderboard shows current best per category. --export-csv for external analysis.

Zero external dependencies. All long-context documents are embedded directly in the test JSON. No fetched URLs. No world-knowledge questions. Tests are self-contained.

What It Tests

Tool Calling (5 tests)

Right tool, right args. Tests selection, argument precision, and knowing when not to call a tool.

Multi-Step Chains (4 tests)

Actual agentic loops. Mock tool responses feed back in. Tests real tool sequencing, not just the first call.

Structured Output (5 tests)

JSON schema compliance, field types, nested objects. Scored on exact field values, not vibes.

Long Context (4 tests)

Needle-in-haystack up to 30k tokens. Cross-referenced answers. Embedded documents, no fetching.

Coding (5 tests)

Generated code runs against test cases in a subprocess. Pass/fail, no partial credit for pretty comments.

Cost Efficiency (4 tests)

Penalizes verbosity. Measures token and tool call minimization alongside correctness.

Agentic Workflow (5 tests)

End-to-end pipelines: research a model, check hardware, install via Ollama, verify, benchmark, write a report. Up to 15 turns, 10 checks, real branching logic.

The multi-step execution deserves a note. Most benchmark runners make one API call and check if the first tool call looked right. That's not testing agentic capability — that's testing if the model can follow a one-shot instruction. Milo-Bench loops: model calls a tool, runner returns a mock response, model calls another tool, repeat. The scoring checks the full history. You find out pretty quickly which models actually chain reasoning vs. which ones just pattern-match the first step.

The new agentic_workflow category (added post-launch) pushes this further. Where multi_step tests max out at ~3 tool calls over ~5 turns, the agentic_workflow tests run 7–12 tool calls over up to 15 turns, with each step depending on the previous step's output. The capstone test (aw-005) has a model research a new LLM, check hardware compatibility, install via Ollama, verify it's serving, run a coding benchmark against it, and write a structured report — all in one coherent chain. Ten checks, all deterministic. Models that shortcut steps or lose state mid-chain score clearly lower than models that maintain coherent context all the way through.

Running It

# Clone and configure
git clone https://github.com/jmeadlock/milo-bench
cd milo-bench
cp .env.example .env  # add OPENCLAW_TOKEN or API keys

# Run cloud models via OpenClaw gateway
python3 bench.py --models cloud --report

# Run a local model with full attribution
python3 bench.py --models local \
  --engine-name "mlx_lm" \
  --engine-version "0.31.2" \
  --hardware "M3 Ultra 512GB" \
  --model-version "Qwen3.5-397B-2026-04"

# Compare a model's history across all past runs
python3 bench.py --compare "anthropic/claude-sonnet-4-6"

# Current best per category
python3 bench.py --leaderboard

The --model-version flag matters. When your local endpoint serves different model weights over time — say you update from Qwen 397B April to Qwen 397B August — the DB needs to know which weights produced which results. Pass a descriptive version string and it stores it alongside the scores.

What It Doesn't Test

Vision. Audio. Long-form generation quality. Reasoning trace evaluation. Very long context beyond 30k tokens.

The focus is agentic capability: can the model use tools correctly, follow multi-step instructions, produce valid structured output, and find information in context? Those are the things that matter for real workloads. Everything else is on the roadmap but not in v1.0, because adding more tests before the core is stable defeats the purpose.

Why We're Sharing It

This isn't a product. There's no company, no SaaS, no roadmap slide. It's a home lab tool that needed to exist, so we built it.

The reason we needed it: we run a lot of models across a lot of hardware, and we kept losing the ability to answer "is this new model actually better than what I had six months ago?" because every time we went to compare, something had changed — the tests, the scoring, the context source. Milo-Bench locks those things down so the comparisons stay valid.

If you're doing similar work — tracking model progress over time, comparing inference engines, figuring out whether local models are closing the gap on cloud — the frozen-spec approach might be useful to you too.

Results

Three models benched so far. MiniMax M2.7 Q8 was the first run, followed by DeepSeek V4 Pro and V4 Flash via cloud inference. Scores below.

Model	Provider	Overall	Tool Call	Multi-Step	Structured	Long CTX	Coding	Cost Eff.	Agentic WF	Avg Latency
DeepSeek V4 Pro	Fireworks AI	0.80	1.00	0.90	1.00	1.00	0.80	0.76	0.20	~12s
DeepSeek V4 Flash	NVIDIA NIM	0.78 *	1.00	0.90	1.00	0.83	0.78	1.00	0.00 *	~15s
MiniMax M2.7 Q8	Local (Mac Studio)	0.78	1.00	0.90	1.00	0.83	0.40	0.76	0.59	—

* V4 Flash agentic_workflow scored 0.00 due to NIM rate limiting (HTTP 429) on all 5 agentic tests after one timeout. This is an infrastructure issue, not a model capability issue. NIM's V4 Flash endpoint appears to have strict RPM limits. Re-run pending on Fireworks or DeepInfra for an apples-to-apples comparison with V4 Pro.

Takeaways

V4 Pro leads at 0.80 — strongest showing is long context (1.00) and clean structured output. Agentic workflow scored 0.20 which is better than Flash's infrastructure-impacted 0.00 but still the weakest category for all models.
V4 Flash matches M2.7 Q8 at 0.78 but with a more favorable profile — perfect cost efficiency and structured output, with the agentic score asterisked. If you factor out the rate limiting, Flash is likely competitive with V4 Pro on agentic tasks.
MiniMax M2.7 Q8 local holds up well for a 228GB model running on local hardware. Coding is the weak spot (0.40) — Q8 quantization affects reasoning-heavy generation more than recall or structure. Agentic workflow at 0.59 is the best of the three for that category.
Tool calling is solved — all three models score 1.00. That's the good news. The benchmark now reveals the next frontier: multi-step coherence and end-to-end agentic workflows.

More models incoming. Claude Sonnet 4.6 (current production baseline), Qwen3.5-397B local, and a re-run of V4 Flash on a provider without rate limits. The leaderboard will update here as each run completes.

— James & Milo

→ github.com/jmeadlock/milo-bench