Seven models, same 20 prompts, deterministic scoring. The question was simple: how does a locally-run 397B parameter model compare to the top cloud models on agentic tool calling?
The answer was surprising.
| Rank | Model | Score | % | Avg Latency | Context | Where |
|---|---|---|---|---|---|---|
| 🥇 | Qwen3.5-397B-A17B-4bit | 58.0/60 | 96.7% | 2.6s | 256K | Local |
| 🥇 | Claude Opus 4.6 | 58.0/60 | 96.7% | 2.6s | 1M | Cloud API |
| 3 | Grok 4.20 | 57.0/60 | 95.0% | 2.6s | 256K | Cloud API |
| 3 | Grok 4.1 Fast | 57.0/60 | 95.0% | 3.5s | 2M | Cloud API |
| 3 | Claude Sonnet 4.6 | 57.0/60 | 95.0% | 1.9s | 1M | Cloud API |
| 6 | GPT-5.4 | 55.0/60 | 91.7% | 1.2s | 1M | Cloud API |
| 6 | Gemini 3.1 Pro | 55.0/60 | 91.7% | 3.9s | 2M | Cloud API |
| 6 | Nemotron-3-Super 120B (Ollama, DGX Spark 2) | 55.0/60 | 91.7% | 7.0s | 128K | Local |
| 9 | GLM-5.1-40B-MXFP4 (mlx-community, Mac Studio) | 54.8/60 | 91.3% | 17.4s | 128K | Local |
| 10 | GPT-5.2 | 53.7/60 | 89.4% | 1.2s | 1M | Cloud API |
| 10 | MiniMax M2.7 (baa-ai MLX) | 53.7/60 | 89.4% | 2.6s | 256K | Local |
Milo-Bench — 20 prompts, 5 categories, 3 runs each, scored 0–3 per prompt. All checks are deterministic: exact tool name match, argument schema validation, multi-turn chain completion. No rubric, no human judgment.
| Category | 397B | Opus 4.6 | Grok 4.20 | Grok 4.1 | Sonnet 4.6 | GPT-5.4 | Gemini 3.1 | GPT-5.2 | M2.7 |
|---|---|---|---|---|---|---|---|---|---|
| Single tool | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 |
| Tool selection | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 2.50 | 2.75 | 2.50 | 3.00 |
| Multi-step chains | 2.50 | 2.50 | 2.50 | 2.50 | 2.50 | 2.42 | 2.00 | 2.42 | 2.00 |
| Structured output | 3.00 | 3.00 | 3.00 | 2.75 | 2.75 | 2.50 | 3.00 | 2.00 | 2.42 |
| Error recovery | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 | 3.00 |
Multi-step chains are the hard category — every model drops points here. The failure mode is consistent across models: the first tool call is correct, but the model doesn't follow up with the required second call. This is a prompt engineering / system prompt issue as much as a model capability issue.
1.2s average latency — significantly faster than every other model including Sonnet. The tradeoff is accuracy: GPT-5.4 scores 91.7%, missing on tool selection and multi-step chains where other models succeed. For latency-sensitive workloads where accuracy can tolerate some slippage, it's the obvious choice.
Qwen3.5-397B-A17B-4bit runs on 416GB of the Mac Studio's 512GB unified memory. Inference via mlx_lm.server 0.31.2, temperature=0, thinking mode disabled. It ties Opus 4.6 — the most capable cloud model tested — on overall score, with identical per-category results. The cost per call is $0 after hardware.
M2.7 has roughly twice the parameters of 397B but scores 7 points lower. Always-on thinking mode is the likely culprit — it reasons through every response including simple JSON generation, which introduces drift. The mxfp4 quant (not yet available as of this writing) may tell a different story. The baa-ai MLX weights used here also require --trust-remote-code due to a non-standard model architecture.
This is a tool-calling benchmark, not a general capability benchmark. Models that score well here may not rank the same on coding, reasoning, or writing tasks. The 20-prompt suite is intentionally narrow — real agentic workloads are messier.
All runs, raw JSON, and scoring code are in jmeadlock/milo-bench.