Three models, same benchmark. Two run locally on a Mac Studio M3 Ultra. One is Claude Sonnet 4.6 via API. The question: how close can local get to cloud on agentic tool calling?
Hardware: Mac Studio M3 Ultra, 512GB unified memory. All local models via mlx_lm.server 0.31.2.
temperature=0.0temperature=1.0, top_p=0.95, top_k=40 per official docs, served with --trust-remote-codeBenchmark: Milo-Bench — 20 prompts across 5 categories, 3 runs each, deterministic scoring (0–3 per prompt). No rubric, no subjectivity. 180 points total.
| Category | Qwen3.5-397B 🖥️ 416GB RAM |
Sonnet 4.6 ☁️ API |
MiniMax M2.7 🖥️ 203GB RAM |
|---|---|---|---|
| Single tool | 3.00/3 · 4.5s | 3.00/3 · 1.6s | 3.00/3 · 2.3s |
| Tool selection | 3.00/3 · 1.5s | 3.00/3 · 1.7s | 3.00/3 · 2.0s |
| Multi-step chains | 2.50/3 · 2.1s | 2.50/3 · 2.0s | 2.00/3 · 2.7s |
| Structured output | 3.00/3 · 3.0s | 2.75/3 · 2.4s | 2.42/3 · 3.7s |
| Error recovery | 3.00/3 · 1.7s | 3.00/3 · 1.7s | 3.00/3 · 2.2s |
| Total | 58.0/60 — 96.7% 2.6s avg |
57.0/60 — 95.0% 1.9s avg |
53.7/60 — 89.4% 2.6s avg |
397B wins outright. One point ahead of Sonnet 4.6. That gap is structured output — 397B at temperature=0 reliably hits schema requirements that Sonnet occasionally softens. Multi-step chains are a tie. Everything else is a tie.
Sonnet is faster. 1.9s average vs 2.6s for both local models. API round-trips included. The latency advantage is real but not dramatic — we're talking 0.7s per call.
M2.7 is the surprise underperformer. It has nearly twice the parameters of 397B and uses half the RAM, yet scores 6.3 points lower. The culprit is almost certainly the always-on thinking mode — M2.7 reasons through every response, including simple JSON generation, which introduces drift before it lands on an output. The mxfp4 quant (not tested here) may tell a different story.
The practical takeaway for agentic workloads on Apple Silicon: Qwen3.5-397B is competitive with cloud Sonnet 4.6 on tool calling, runs entirely locally, and costs nothing per call after the hardware. The tradeoff is 416GB of unified memory and 2.6s latency instead of 1.9s.
baa-ai/MiniMax-M2.7-RAM-203GB-MLX, not the mlx-community mxfp4 quant. The baa-ai weights bundle a custom model class requiring --trust-remote-code and have thinking baked in with no disable flag. The mxfp4 quant may score higher. We'll rerun when it's available.
Three model runs, back to back, on the same machine. The benchmark script sends identical prompts to each endpoint, collects tool_calls responses, scores them deterministically, saves to SQLite. No human evaluation involved. Total wall time: about 25 minutes across all three models.
All raw results in Milo-Bench. JSON files in bench_results/.