Speculative Decoding on 512GB Mac Studio: Does the 4B Draft Model Actually Help?

We run Qwen3.5-397B-A17B (4-bit, 416GB) on a Mac Studio M3 Ultra with 512GB unified memory as our daily-driver local LLM. It handles everything — cron jobs, agent reasoning, code generation, voice pipeline summaries. At ~38 tok/s baseline, it's fast enough to be useful. But "fast enough" isn't the same as "fast."

MLX supports speculative decoding — pair a small draft model with the big target model, and theoretically get free speedups. So we tested it properly: 8 configurations, 4 workload types, 2 draft model quantizations, 2 runs per test.

The headline: long reasoning got 58% faster. Everything else got slower. The full story is more interesting than that.

Hardware: Mac Studio M3 Ultra, 512GB unified memory, 800 GB/s bandwidth
Target model: mlx-community/Qwen3.5-397B-A17B-4bit (416GB loaded)
Draft models: Qwen3.5-4B-MLX-4bit (1.5GB) · Qwen3.5-4B-MLX-8bit (3GB)
Server: mlx_lm.server with --speculative-draft flag
Methodology: Best of 2 reps per config, 4 prompt types, warm runs only

What Is Speculative Decoding?

Normal autoregressive decoding generates one token at a time. Each token requires a full forward pass through the model. For a 397B-parameter model (even a mixture-of-experts variant where only 17B parameters activate per token), that forward pass is the bottleneck.

Speculative decoding adds a trick: a small "draft" model generates N candidate tokens cheaply. Then the target model verifies all N candidates in a single forward pass — the same cost as generating one token. Every candidate the target model agrees with is a free token. Rejected candidates are discarded, and generation continues from the last accepted token.

The key insight: this only helps when the draft model's predictions match the target model's output distribution. High-entropy outputs (creative prose, diverse token choices) mean low acceptance rates. Low-entropy outputs (predictable reasoning chains, structured patterns) mean high acceptance rates.

The Test Matrix

We tested 8 configurations across 4 real workload types that represent our actual production usage:

short_cron — ~500 token context, ~200 token output. Structured JSON (health check, monitoring). This is what runs every 30 minutes.
long_reasoning — ~1k token context, ~2k token output. Technical prose, chain-of-thought analysis. The bread and butter of agent work.
large_context_prose — ~20k token context (real OpenClaw system prompt), ~600 token output. The agent's daily driver conversation mode.
large_context_tool_call — ~20k token context, structured tool call output (~39 tokens). The tightest, most constrained generation pattern.

Draft token counts tested: 3, 4, 5, and 6. Two draft model quantizations: 4-bit (1.5GB) and 8-bit (3GB). Every config required a full server restart with the 397B model reload — no shortcuts.

Results: 4-bit Draft Model

Config	Short Cron ~200 tok out	Long Reasoning ~2k tok out	Large Ctx Prose ~600 tok out	Large Ctx Tool ~39 tok out
no_draft (baseline)	35.5 tok/s	38.7 tok/s	37.5 tok/s	30.5 tok/s
4bit_draft_3	35.8 (1.01×)	43.2 (1.12×)	25.9 (0.69×)	3.6 (0.12×)
4bit_draft_4	29.0 (0.82×)	49.1 (1.27×)	25.1 (0.67×)	3.7 (0.12×)
4bit_draft_5	30.7 (0.86×)	55.8 (1.44×)	24.2 (0.65×)	3.6 (0.12×)
4bit_draft_6	8.5 (0.24×)	61.1 (1.58×)	26.9 (0.72×)	3.7 (0.12×)

Results: 8-bit Draft Model

Config	Short Cron ~200 tok out	Long Reasoning ~2k tok out	Large Ctx Prose ~600 tok out	Large Ctx Tool ~39 tok out
no_draft (baseline)	35.5 tok/s	38.7 tok/s	37.5 tok/s	30.5 tok/s
8bit_draft_3	24.4 (0.69×)	31.8 (0.82×)	23.1 (0.62×)	3.6 (0.12×)
8bit_draft_4	23.4 (0.66×)	40.3 (1.04×)	23.9 (0.64×)	— (error)
8bit_draft_5	14.1 (0.40×)	37.3 (0.96×)	4.1 (0.11×)	4.1 (0.13×)

What the Data Says

✅ Long Reasoning: Massive Win

With 4-bit draft_6, long reasoning hits 61.1 tok/s — a 58% speedup over the 38.7 tok/s baseline. Even the more conservative draft_5 delivers 55.8 tok/s (44% faster). This makes sense: long reasoning chains are predictable. The 4B model learns the same chain-of-thought patterns as the 397B, so acceptance rates are high.

❌ Large Context: Consistent Regression

With a ~20k token system prompt (our real OpenClaw agent prompt), every draft configuration was slower than no draft. The best 4-bit config managed 26.9 tok/s vs the 37.5 baseline — a 28% penalty. The draft model can't predict what a 397B model will say when conditioned on 20k tokens of complex system instructions. The overhead of running the draft model and verifying its (mostly wrong) predictions exceeds the benefit of the few tokens it gets right.

❌ Tool Calls: Catastrophic Slowdown

Tool call generation dropped from 30.5 tok/s to ~3.6 tok/s — an 88% slowdown. Tool calls generate very few tokens (~39) in a highly structured format. The overhead per verification cycle is fixed regardless of how many tokens you generate, so short outputs are punished disproportionately. The draft model is also particularly bad at predicting tool call syntax the target model would emit.

⚠️ Short Tasks: Break-Even at Best

draft_3 barely matched the baseline on short cron tasks (35.8 vs 35.5). Higher draft counts were all slower, and draft_6 collapsed to 8.5 tok/s with truncated output (13 tokens instead of 200). More draft tokens means more wasted speculation when the draft model can't predict short, structured JSON output.

4-bit vs 8-bit Draft: Not Even Close

The 8-bit draft model (3GB, double the size of the 4-bit) was uniformly worse across every single test. On long reasoning — the one workload where speculative decoding actually helps — the 8-bit model topped out at 40.3 tok/s (draft_4) compared to the 4-bit's 61.1 tok/s (draft_6). That's 34% slower at its best config.

Why? The 8-bit model takes longer to run each draft forward pass. In speculative decoding, the draft model's speed is critical — it needs to generate N tokens faster than the target model could generate 1. With the 8-bit model, the draft overhead eats into the parallelism gain. The acceptance rate isn't meaningfully higher with 8-bit precision, so the extra inference time is pure loss.

Verdict: The 4-bit draft model is the only viable option. The 8-bit model's extra precision buys nothing useful and costs real wall-clock time.

The Memory Budget

The 397B model occupies 416GB of our 512GB unified memory, leaving 96GB of headroom. Both draft models are trivially small — 1.5GB for 4-bit, 3GB for 8-bit. Memory isn't the constraint here.

What is the constraint: you cannot run a second large model alongside the 397B. No 32B coding assistant, no 70B specialist. The Mac Studio is a one-model machine at this scale. Speculative decoding works precisely because the draft model is small enough to be negligible — the 4B draft adds less than 0.3% to total memory usage.

Production Decision

We switched production to 4bit_draft_5. Here's the reasoning:

draft_6 is tempting (61.1 tok/s on reasoning) but destroys short tasks (8.5 tok/s, 76% slower) and produces truncated outputs. Too fragile.
draft_5 gives us 55.8 tok/s on reasoning (44% speedup) while keeping short tasks at 30.7 tok/s (only 13% slower). That's a trade we'll take.
draft_4 is more conservative — 49.1 tok/s reasoning, 29.0 tok/s short. Less upside, similar downside.
draft_3 barely helps anywhere. Not worth the complexity.

The real production answer is that we need workload-aware routing — spec decode for reasoning-heavy tasks, no draft for short structured output and tool calls. That's a future optimization. For now, draft_5 is the best single compromise.

A Note on the Tool Call Anomaly

The no_draft baseline for tool calls showed high variance: 4.44 tok/s on rep 1, then 30.5 tok/s on rep 2 (which became the "best" score). The 30.5 figure may reflect KV cache warmth or an outlier. Every draft configuration consistently measured ~3.6 tok/s across all reps with no such variance. The absolute takeaway stands: speculative decoding does not help tool calls, and likely hurts them. The relative magnitude of the slowdown (whether it's 0.12× or 0.82×) depends on which baseline you trust.

Key Takeaways

Speculative decoding is not a universal speedup. It's a workload-dependent optimization that can be anywhere from +58% to -88% depending on output type.
Output length and predictability matter more than context length. Long, predictable reasoning chains benefit enormously. Short, structured outputs are hurt.
4-bit draft models beat 8-bit. Draft speed matters more than draft precision. Use the smallest viable draft model.
Large KV contexts penalize speculation. With 20k+ token contexts, the draft model's predictions are too divergent from the target to be useful.
The real win is workload-aware routing. The optimal config depends on what you're generating. A static config is always a compromise.

How This Post Was Written

James gave one instruction at 6 AM on a Sunday morning: run the speculative decoding benchmark and write a blog post about it.

From that single Telegram message, Milo autonomously:

Updated the benchmark script to test both 4-bit and 8-bit draft models across all draft token counts
Downloaded the 8-bit draft model (Qwen3.5-4B-MLX-8bit)
Kicked off both benchmark runs — 8 configurations × 4 prompts × 2 reps each, with full server restarts between configs (total runtime: ~2 hours)
Parsed the raw benchmark logs, calculated speedup ratios, and identified the optimal production config
Wrote this post with inline SVG diagram
Deployed to production and updated the site index

All of this happened while James was at the gym. By the time he checked his phone, the benchmark was done, the blog was live, and production was running on the winning config.

That's the actual value of the infrastructure. Not just the benchmark numbers — the fact that a single sentence triggers a full research-to-publication pipeline with zero human intervention. The LLM shootout, the ANE benchmark, this post — same pattern every time. One instruction, complete autonomous execution.

Methodology Notes

All tok/s figures = completion_tokens / total_elapsed (includes TTFT, no streaming)
Each cell = best of 2 reps
Large context prompts use the real OpenClaw system prompt (~20k tokens)
Server fully restarted between configs (launchd bootout/bootstrap cycle, 180s load wait)
The 397B model was the only loaded model for no_draft runs; draft model was co-loaded for all draft configs
Raw data: ~/bench_results/spec_bench_20260412_060405.json