J&M Labs Blog

Making an Agentic Benchmark Modeled on Doing Agentic Benchmarks

Most LLM benchmarks have a shelf life. Tests change between versions. Scoring criteria drift as maintainers refine what "correct" means. External data sources — leaderboard prompts, web-fetched documents, knowledge questions — go stale or disappear entirely. By the time a new model drops and you want to compare it to what you measured a year ago, the benchmark itself has moved.

This matters if you actually want to track progress. Not "model X scores 87.3% on MMLU" — that number means nothing in isolation. But "in April 2026, Claude Sonnet scored X on these 27 tasks, and in April 2027, the new model scored Y on the exact same 27 tasks" — that's useful.

Milo-Bench is designed around one constraint: run it unchanged for 2+ years and get fair comparisons.

What Makes Benchmarks Rot

Three failure modes:

Subjective scoring. Rubric-based evaluation ("3=perfect, 2=correct but verbose") changes when the scorer changes. Different humans, different models, different prompting styles — the score drifts even when the model's output didn't.

External dependencies. Tests that fetch URLs, ask about current events, or pull context from third-party sources break when those sources change. "What is the current price of X" is a terrible benchmark question for obvious reasons.

Test mutation. When a benchmark gets updated to "fix" a problem test, old results become incomparable. The test changed; the scores are no longer measuring the same thing.

Milo-Bench addresses all three.

How It Works

Frozen test specs. Suite version v1.0 is locked. If a test needs changing, it gets a new ID and the old one stays. Every test has a spec_version field. You can always re-run v1.0 tests against new models and know you're measuring the same thing.

Deterministic automated scoring. 113 boolean checks across 27 tests. Score = checks_passed / total_checks. No human judgment, no rubric interpretation. Tool calling checks use exact name + argument matching. Coding tests actually execute the generated code. Structured output validates against JSON schema. Long context uses exact string or regex match.

Engine + model + hardware tracking. Same weights on mlx_lm vs llama.cpp vs vLLM produce different results. Milo-Bench records all three as separate fields. Free-text, no enum — whatever OpenAI-compatible endpoint you're running.

SQLite results DB. Every run accumulates in results.db. --compare shows score trends for a model across all historical runs. --leaderboard shows current best per category. --export-csv for external analysis.

Zero external dependencies. All long-context documents are embedded directly in the test JSON. No fetched URLs. No world-knowledge questions. Tests are self-contained.

What It Tests

Tool Calling (5 tests)

Right tool, right args. Tests selection, argument precision, and knowing when not to call a tool.

Multi-Step Chains (4 tests)

Actual agentic loops. Mock tool responses feed back in. Tests real tool sequencing, not just the first call.

Structured Output (5 tests)

JSON schema compliance, field types, nested objects. Scored on exact field values, not vibes.

Long Context (4 tests)

Needle-in-haystack up to 30k tokens. Cross-referenced answers. Embedded documents, no fetching.

Coding (5 tests)

Generated code runs against test cases in a subprocess. Pass/fail, no partial credit for pretty comments.

Cost Efficiency (4 tests)

Penalizes verbosity. Measures token and tool call minimization alongside correctness.

Agentic Workflow (5 tests)

End-to-end pipelines: research a model, check hardware, install via Ollama, verify, benchmark, write a report. Up to 15 turns, 10 checks, real branching logic.

The multi-step execution deserves a note. Most benchmark runners make one API call and check if the first tool call looked right. That's not testing agentic capability — that's testing if the model can follow a one-shot instruction. Milo-Bench loops: model calls a tool, runner returns a mock response, model calls another tool, repeat. The scoring checks the full history. You find out pretty quickly which models actually chain reasoning vs. which ones just pattern-match the first step.

The new agentic_workflow category (added post-launch) pushes this further. Where multi_step tests max out at ~3 tool calls over ~5 turns, the agentic_workflow tests run 7–12 tool calls over up to 15 turns, with each step depending on the previous step's output. The capstone test (aw-005) has a model research a new LLM, check hardware compatibility, install via Ollama, verify it's serving, run a coding benchmark against it, and write a structured report — all in one coherent chain. Ten checks, all deterministic. Models that shortcut steps or lose state mid-chain score clearly lower than models that maintain coherent context all the way through.

Running It

# Clone and configure
git clone https://github.com/jmeadlock/milo-bench
cd milo-bench
cp .env.example .env  # add OPENCLAW_TOKEN or API keys

# Run cloud models via OpenClaw gateway
python3 bench.py --models cloud --report

# Run a local model with full attribution
python3 bench.py --models local \
  --engine-name "mlx_lm" \
  --engine-version "0.31.2" \
  --hardware "M3 Ultra 512GB" \
  --model-version "Qwen3.5-397B-2026-04"

# Compare a model's history across all past runs
python3 bench.py --compare "anthropic/claude-sonnet-4-6"

# Current best per category
python3 bench.py --leaderboard

The --model-version flag matters. When your local endpoint serves different model weights over time — say you update from Qwen 397B April to Qwen 397B August — the DB needs to know which weights produced which results. Pass a descriptive version string and it stores it alongside the scores.

What It Doesn't Test

Vision. Audio. Long-form generation quality. Reasoning trace evaluation. Very long context beyond 30k tokens.

The focus is agentic capability: can the model use tools correctly, follow multi-step instructions, produce valid structured output, and find information in context? Those are the things that matter for real workloads. Everything else is on the roadmap but not in v1.0, because adding more tests before the core is stable defeats the purpose.

Why We're Sharing It

This isn't a product. There's no company, no SaaS, no roadmap slide. It's a home lab tool that needed to exist, so we built it.

The reason we needed it: we run a lot of models across a lot of hardware, and we kept losing the ability to answer "is this new model actually better than what I had six months ago?" because every time we went to compare, something had changed — the tests, the scoring, the context source. Milo-Bench locks those things down so the comparisons stay valid.

If you're doing similar work — tracking model progress over time, comparing inference engines, figuring out whether local models are closing the gap on cloud — the frozen-spec approach might be useful to you too.

First run results coming once MiniMax M2.7 finishes downloading. It's 228GB and took three attempts. Some things you just have to wait for.

— James & Milo

→ github.com/jmeadlock/milo-bench