MiniMax M2.7 vs Qwen3.5-397B vs Claude Sonnet 4.6: Tool Calling on Apple Silicon

April 12, 2026 — localbenchmarktool-calling — Milo-Bench

Three models, same benchmark. Two run locally on a Mac Studio M3 Ultra. One is Claude Sonnet 4.6 via API. The question: how close can local get to cloud on agentic tool calling?

Setup

Hardware: Mac Studio M3 Ultra, 512GB unified memory. All local models via mlx_lm.server 0.31.2.

Qwen3.5-397B-A17B-4bit (mlx-community): 397B params, 17B active, 416GB RSS, temperature=0.0
MiniMax M2.7 (baa-ai RAM-203GB-MLX): ~700B params, 203GB RSS, temperature=1.0, top_p=0.95, top_k=40 per official docs, served with --trust-remote-code
Claude Sonnet 4.6: direct Anthropic API, default params

Benchmark: Milo-Bench — 20 prompts across 5 categories, 3 runs each, deterministic scoring (0–3 per prompt). No rubric, no subjectivity. 180 points total.

Results

Category	Qwen3.5-397B 🖥️ 416GB RAM	Sonnet 4.6 ☁️ API	MiniMax M2.7 🖥️ 203GB RAM
Single tool	3.00/3 · 4.5s	3.00/3 · 1.6s	3.00/3 · 2.3s
Tool selection	3.00/3 · 1.5s	3.00/3 · 1.7s	3.00/3 · 2.0s
Multi-step chains	2.50/3 · 2.1s	2.50/3 · 2.0s	2.00/3 · 2.7s
Structured output	3.00/3 · 3.0s	2.75/3 · 2.4s	2.42/3 · 3.7s
Error recovery	3.00/3 · 1.7s	3.00/3 · 1.7s	3.00/3 · 2.2s
Total	58.0/60 — 96.7% 2.6s avg	57.0/60 — 95.0% 1.9s avg	53.7/60 — 89.4% 2.6s avg

What This Means

397B wins outright. One point ahead of Sonnet 4.6. That gap is structured output — 397B at temperature=0 reliably hits schema requirements that Sonnet occasionally softens. Multi-step chains are a tie. Everything else is a tie.

Sonnet is faster. 1.9s average vs 2.6s for both local models. API round-trips included. The latency advantage is real but not dramatic — we're talking 0.7s per call.

M2.7 is the surprise underperformer. It has nearly twice the parameters of 397B and uses half the RAM, yet scores 6.3 points lower. The culprit is almost certainly the always-on thinking mode — M2.7 reasons through every response, including simple JSON generation, which introduces drift before it lands on an output. The mxfp4 quant (not tested here) may tell a different story.

The practical takeaway for agentic workloads on Apple Silicon: Qwen3.5-397B is competitive with cloud Sonnet 4.6 on tool calling, runs entirely locally, and costs nothing per call after the hardware. The tradeoff is 416GB of unified memory and 2.6s latency instead of 1.9s.

M2.7 note: This tested baa-ai/MiniMax-M2.7-RAM-203GB-MLX, not the mlx-community mxfp4 quant. The baa-ai weights bundle a custom model class requiring --trust-remote-code and have thinking baked in with no disable flag. The mxfp4 quant may score higher. We'll rerun when it's available.

How This Was Run

Three model runs, back to back, on the same machine. The benchmark script sends identical prompts to each endpoint, collects tool_calls responses, scores them deterministically, saves to SQLite. No human evaluation involved. Total wall time: about 25 minutes across all three models.

All raw results in Milo-Bench. JSON files in bench_results/.

← Back to blog