We run Qwen3.5-397B-A17B (4-bit, 416GB) on a Mac Studio M3 Ultra with 512GB unified memory as our daily-driver local LLM. It handles everything โ cron jobs, agent reasoning, code generation, voice pipeline summaries. At ~38 tok/s baseline, it's fast enough to be useful. But "fast enough" isn't the same as "fast."
MLX supports speculative decoding โ pair a small draft model with the big target model, and theoretically get free speedups. So we tested it properly: 8 configurations, 4 workload types, 2 draft model quantizations, 2 runs per test.
The headline: long reasoning got 58% faster. Everything else got slower. The full story is more interesting than that.
Target model: mlx-community/Qwen3.5-397B-A17B-4bit (416GB loaded)
Draft models: Qwen3.5-4B-MLX-4bit (1.5GB) ยท Qwen3.5-4B-MLX-8bit (3GB)
Server: mlx_lm.server with --speculative-draft flag
Methodology: Best of 2 reps per config, 4 prompt types, warm runs only
What Is Speculative Decoding?
Normal autoregressive decoding generates one token at a time. Each token requires a full forward pass through the model. For a 397B-parameter model (even a mixture-of-experts variant where only 17B parameters activate per token), that forward pass is the bottleneck.
Speculative decoding adds a trick: a small "draft" model generates N candidate tokens cheaply. Then the target model verifies all N candidates in a single forward pass โ the same cost as generating one token. Every candidate the target model agrees with is a free token. Rejected candidates are discarded, and generation continues from the last accepted token.
The key insight: this only helps when the draft model's predictions match the target model's output distribution. High-entropy outputs (creative prose, diverse token choices) mean low acceptance rates. Low-entropy outputs (predictable reasoning chains, structured patterns) mean high acceptance rates.
The Test Matrix
We tested 8 configurations across 4 real workload types that represent our actual production usage:
- short_cron โ ~500 token context, ~200 token output. Structured JSON (health check, monitoring). This is what runs every 30 minutes.
- long_reasoning โ ~1k token context, ~2k token output. Technical prose, chain-of-thought analysis. The bread and butter of agent work.
- large_context_prose โ ~20k token context (real OpenClaw system prompt), ~600 token output. The agent's daily driver conversation mode.
- large_context_tool_call โ ~20k token context, structured tool call output (~39 tokens). The tightest, most constrained generation pattern.
Draft token counts tested: 3, 4, 5, and 6. Two draft model quantizations: 4-bit (1.5GB) and 8-bit (3GB). Every config required a full server restart with the 397B model reload โ no shortcuts.
Results: 4-bit Draft Model
| Config | Short Cron ~200 tok out |
Long Reasoning ~2k tok out |
Large Ctx Prose ~600 tok out |
Large Ctx Tool ~39 tok out |
|---|---|---|---|---|
| no_draft (baseline) | 35.5 tok/s | 38.7 tok/s | 37.5 tok/s | 30.5 tok/s |
| 4bit_draft_3 | 35.8 (1.01ร) | 43.2 (1.12ร) | 25.9 (0.69ร) | 3.6 (0.12ร) |
| 4bit_draft_4 | 29.0 (0.82ร) | 49.1 (1.27ร) | 25.1 (0.67ร) | 3.7 (0.12ร) |
| 4bit_draft_5 | 30.7 (0.86ร) | 55.8 (1.44ร) | 24.2 (0.65ร) | 3.6 (0.12ร) |
| 4bit_draft_6 | 8.5 (0.24ร) | 61.1 (1.58ร) | 26.9 (0.72ร) | 3.7 (0.12ร) |
Results: 8-bit Draft Model
| Config | Short Cron ~200 tok out |
Long Reasoning ~2k tok out |
Large Ctx Prose ~600 tok out |
Large Ctx Tool ~39 tok out |
|---|---|---|---|---|
| no_draft (baseline) | 35.5 tok/s | 38.7 tok/s | 37.5 tok/s | 30.5 tok/s |
| 8bit_draft_3 | 24.4 (0.69ร) | 31.8 (0.82ร) | 23.1 (0.62ร) | 3.6 (0.12ร) |
| 8bit_draft_4 | 23.4 (0.66ร) | 40.3 (1.04ร) | 23.9 (0.64ร) | โ (error) |
| 8bit_draft_5 | 14.1 (0.40ร) | 37.3 (0.96ร) | 4.1 (0.11ร) | 4.1 (0.13ร) |
What the Data Says
โ Long Reasoning: Massive Win
With 4-bit draft_6, long reasoning hits 61.1 tok/s โ a 58% speedup over the 38.7 tok/s baseline. Even the more conservative draft_5 delivers 55.8 tok/s (44% faster). This makes sense: long reasoning chains are predictable. The 4B model learns the same chain-of-thought patterns as the 397B, so acceptance rates are high.
โ Large Context: Consistent Regression
With a ~20k token system prompt (our real OpenClaw agent prompt), every draft configuration was slower than no draft. The best 4-bit config managed 26.9 tok/s vs the 37.5 baseline โ a 28% penalty. The draft model can't predict what a 397B model will say when conditioned on 20k tokens of complex system instructions. The overhead of running the draft model and verifying its (mostly wrong) predictions exceeds the benefit of the few tokens it gets right.
โ Tool Calls: Catastrophic Slowdown
Tool call generation dropped from 30.5 tok/s to ~3.6 tok/s โ an 88% slowdown. Tool calls generate very few tokens (~39) in a highly structured format. The overhead per verification cycle is fixed regardless of how many tokens you generate, so short outputs are punished disproportionately. The draft model is also particularly bad at predicting tool call syntax the target model would emit.
โ ๏ธ Short Tasks: Break-Even at Best
draft_3 barely matched the baseline on short cron tasks (35.8 vs 35.5). Higher draft counts were all slower, and draft_6 collapsed to 8.5 tok/s with truncated output (13 tokens instead of 200). More draft tokens means more wasted speculation when the draft model can't predict short, structured JSON output.
4-bit vs 8-bit Draft: Not Even Close
The 8-bit draft model (3GB, double the size of the 4-bit) was uniformly worse across every single test. On long reasoning โ the one workload where speculative decoding actually helps โ the 8-bit model topped out at 40.3 tok/s (draft_4) compared to the 4-bit's 61.1 tok/s (draft_6). That's 34% slower at its best config.
Why? The 8-bit model takes longer to run each draft forward pass. In speculative decoding, the draft model's speed is critical โ it needs to generate N tokens faster than the target model could generate 1. With the 8-bit model, the draft overhead eats into the parallelism gain. The acceptance rate isn't meaningfully higher with 8-bit precision, so the extra inference time is pure loss.
Verdict: The 4-bit draft model is the only viable option. The 8-bit model's extra precision buys nothing useful and costs real wall-clock time.
The Memory Budget
The 397B model occupies 416GB of our 512GB unified memory, leaving 96GB of headroom. Both draft models are trivially small โ 1.5GB for 4-bit, 3GB for 8-bit. Memory isn't the constraint here.
What is the constraint: you cannot run a second large model alongside the 397B. No 32B coding assistant, no 70B specialist. The Mac Studio is a one-model machine at this scale. Speculative decoding works precisely because the draft model is small enough to be negligible โ the 4B draft adds less than 0.3% to total memory usage.
Production Decision
We switched production to 4bit_draft_5. Here's the reasoning:
- draft_6 is tempting (61.1 tok/s on reasoning) but destroys short tasks (8.5 tok/s, 76% slower) and produces truncated outputs. Too fragile.
- draft_5 gives us 55.8 tok/s on reasoning (44% speedup) while keeping short tasks at 30.7 tok/s (only 13% slower). That's a trade we'll take.
- draft_4 is more conservative โ 49.1 tok/s reasoning, 29.0 tok/s short. Less upside, similar downside.
- draft_3 barely helps anywhere. Not worth the complexity.
The real production answer is that we need workload-aware routing โ spec decode for reasoning-heavy tasks, no draft for short structured output and tool calls. That's a future optimization. For now, draft_5 is the best single compromise.
A Note on the Tool Call Anomaly
The no_draft baseline for tool calls showed high variance: 4.44 tok/s on rep 1, then 30.5 tok/s on rep 2 (which became the "best" score). The 30.5 figure may reflect KV cache warmth or an outlier. Every draft configuration consistently measured ~3.6 tok/s across all reps with no such variance. The absolute takeaway stands: speculative decoding does not help tool calls, and likely hurts them. The relative magnitude of the slowdown (whether it's 0.12ร or 0.82ร) depends on which baseline you trust.
Key Takeaways
- Speculative decoding is not a universal speedup. It's a workload-dependent optimization that can be anywhere from +58% to -88% depending on output type.
- Output length and predictability matter more than context length. Long, predictable reasoning chains benefit enormously. Short, structured outputs are hurt.
- 4-bit draft models beat 8-bit. Draft speed matters more than draft precision. Use the smallest viable draft model.
- Large KV contexts penalize speculation. With 20k+ token contexts, the draft model's predictions are too divergent from the target to be useful.
- The real win is workload-aware routing. The optimal config depends on what you're generating. A static config is always a compromise.
Methodology Notes
- All tok/s figures =
completion_tokens / total_elapsed(includes TTFT, no streaming) - Each cell = best of 2 reps
- Large context prompts use the real OpenClaw system prompt (~20k tokens)
- Server fully restarted between configs (launchd bootout/bootstrap cycle, 180s load wait)
- The 397B model was the only loaded model for no_draft runs; draft model was co-loaded for all draft configs
- Raw data:
~/bench_results/spec_bench_20260412_060405.json