← Back to blog

The Tool-Calling Benchmark: 9 Models, Local vs Cloud

April 12, 2026 — benchmarklocaltool-callingMilo-Bench on GitHub

Seven models, same 20 prompts, deterministic scoring. The question was simple: how does a locally-run 397B parameter model compare to the top cloud models on agentic tool calling?

The answer was surprising.

Results

Rank Model Score % Avg Latency Context Where
🥇 Qwen3.5-397B-A17B-4bit 58.0/60 96.7% 2.6s 256K Local
🥇 Claude Opus 4.6 58.0/60 96.7% 2.6s 1M Cloud API
3 Grok 4.20 57.0/60 95.0% 2.6s 256K Cloud API
3 Grok 4.1 Fast 57.0/60 95.0% 3.5s 2M Cloud API
3 Claude Sonnet 4.6 57.0/60 95.0% 1.9s 1M Cloud API
6 GPT-5.4 55.0/60 91.7% 1.2s 1M Cloud API
6 Gemini 3.1 Pro 55.0/60 91.7% 3.9s 2M Cloud API
6 Nemotron-3-Super 120B (Ollama, DGX Spark 2) 55.0/60 91.7% 7.0s 128K Local
9 GLM-5.1-40B-MXFP4 (mlx-community, Mac Studio) 54.8/60 91.3% 17.4s 128K Local
10 GPT-5.2 53.7/60 89.4% 1.2s 1M Cloud API
10 MiniMax M2.7 (baa-ai MLX) 53.7/60 89.4% 2.6s 256K Local
The local 397B ties Opus 4.6 and beats every other cloud model. It runs on a Mac Studio M3 Ultra at $0/call after hardware cost. Latency is comparable to cloud at 2.6s average. Nemotron-3-Super 120B on DGX Spark also clears the 90% bar at 91.7%, though at 7s/call latency — viable for async/batch workloads. GLM-5.1-40B-MXFP4 (91.3%, 17.4s avg) joins the 90%+ tier — notable for a 40B active-parameter MoE running locally, though long-context tasks drive the latency up.

What Was Tested

Milo-Bench — 20 prompts, 5 categories, 3 runs each, scored 0–3 per prompt. All checks are deterministic: exact tool name match, argument schema validation, multi-turn chain completion. No rubric, no human judgment.

Category Breakdown

Category 397B Opus 4.6 Grok 4.20 Grok 4.1 Sonnet 4.6 GPT-5.4 Gemini 3.1 GPT-5.2 M2.7
Single tool 3.003.003.003.003.003.003.003.003.00
Tool selection 3.003.003.003.003.002.502.752.503.00
Multi-step chains 2.502.502.502.502.502.422.002.422.00
Structured output 3.003.003.002.752.752.503.002.002.42
Error recovery 3.003.003.003.003.003.003.003.003.00

Multi-step chains are the hard category — every model drops points here. The failure mode is consistent across models: the first tool call is correct, but the model doesn't follow up with the required second call. This is a prompt engineering / system prompt issue as much as a model capability issue.

Notable Findings

GPT-5.4 is the speed outlier

1.2s average latency — significantly faster than every other model including Sonnet. The tradeoff is accuracy: GPT-5.4 scores 91.7%, missing on tool selection and multi-step chains where other models succeed. For latency-sensitive workloads where accuracy can tolerate some slippage, it's the obvious choice.

Local 397B matches cloud ceiling at lower cost

Qwen3.5-397B-A17B-4bit runs on 416GB of the Mac Studio's 512GB unified memory. Inference via mlx_lm.server 0.31.2, temperature=0, thinking mode disabled. It ties Opus 4.6 — the most capable cloud model tested — on overall score, with identical per-category results. The cost per call is $0 after hardware.

MiniMax M2.7 underperforms its size

M2.7 has roughly twice the parameters of 397B but scores 7 points lower. Always-on thinking mode is the likely culprit — it reasons through every response including simple JSON generation, which introduces drift. The mxfp4 quant (not yet available as of this writing) may tell a different story. The baa-ai MLX weights used here also require --trust-remote-code due to a non-standard model architecture.

Hardware

Caveats

This is a tool-calling benchmark, not a general capability benchmark. Models that score well here may not rank the same on coding, reasoning, or writing tasks. The 20-prompt suite is intentionally narrow — real agentic workloads are messier.

All runs, raw JSON, and scoring code are in jmeadlock/milo-bench.


← Back to blog