GLM-5.2 Optimization: Prefill-Step-Size Tuning & Spec-Decode Blockers
mlx_lm.server invocation — raising --prefill-step-size from 256 to 2048 — gave us a ~5× prefill throughput improvement (36 → 179 tok/s on a 4.8K prompt). Speculative decoding with a draft model is blocked: GLM-5 uses a 154,820-vocab Zhipu tokenizer that no small MLX-format model shares, making it a dead path until Zhipu releases a matching draft. Decision: Deployed the step-size bump to production (m3u-glm52-8026). No further speed optimizations are practical on the current MLX serving path.
- The one-flag change that worked:
--prefill-step-size 2048. - Benchmark before/after numbers.
- Why speculative decoding is blocked for GLM-5 on MLX.
- Other optimization paths we investigated and rejected.
- What we'd need to unlock the next tier of performance.
Background
We deployed GLM-5.2 MXFP4 (~368 GB, 76 shards) on an M3 Ultra Mac Studio (512 GB) following a separate recipe post. The serving stack is:
Hermes / OpenAI client → :8026 (curl-backed strip-model proxy) → :8025 (mlx_lm.server)
The initial server command used the default-ish --prefill-step-size 256, meaning the server
processed 256 prompt tokens per attention pass. For a model this large — 78 layers, 64 KV heads,
MoE with 10 active experts — each pass is memory-bandwidth-bound. Processing the prompt in
256-token bites meant many passes over the full 368 GB of weights.
The original recipe post measured ~36 tok/s on a prefill-heavy 4.5K prompt (256 per step). Extrapolated to a 200K agent transcript: ~29 minutes just to ingest the prompt.
The Change: --prefill-step-size 2048
mlx_lm.server has a --prefill-step-size N flag that controls how many prompt
tokens are processed in each prefill step. The tradeoff is straightforward:
- Larger step size = fewer passes over the weights = higher prefill throughput (up to the point where the KV cache allocation for that step exceeds available memory).
- Smaller step size = more passes, but less memory per pass and lower latency-to-first-token for very short prompts.
Given the M3 Ultra has 512 GB and the model is ~368-395 GB in memory, we had room to increase the
step by 8×. Updated com.milo.glm52-mlx-server.plist:
--prefill-step-size 2048
Benchmark: Before vs After
All measurements were taken against 127.0.0.1:8025 (direct, no proxy), temperature 0.0, after a fresh server restart. The "before" number for prefill rate is reconstructed from the old server logs (documented in the recipe post).
| Measurement | Before (step=256) | After (step=2048) | Speedup |
|---|---|---|---|
| Cold first inference (7-token prompt) | ~3.5 s | 3.4 s | ≈ 1× |
| Warm decode | ~18 tok/s | ~3-5 tok/s | — |
| Prefill throughput (4.8K prompt, first pass) | ~36 tok/s | 179 tok/s | ~5× |
| Prefill throughput (4.8K prompt, cached, second call) | — | 4,806 tok/s | near-instant |
The headline number: 179 tok/s prefill throughput, up from ~36 tok/s. That is 5× faster. On a 50K-token agent transcript, the prefill drops from ~23 minutes to ~4.7 minutes.
Prompt Cache Performance
The server is configured with --prompt-cache-size 2 --prompt-cache-bytes 2147483648
(2 concurrent cached sequences, 2 GB max). When the same long prompt was sent twice, the second
call took only 1.9 seconds (4,806 prompt tokens cached). This confirms the
prompt cache is working correctly and is a substantial win for repeated queries — e.g., the
same system prompt repeated across Terminal-Bench tasks.
What Didn't Work: Speculative Decoding
MLX's mlx_lm.server supports a --draft-model flag for speculative
decoding (spec-decode). The idea: run a small "draft" model to predict the next N tokens, then
have the large model verify them in a single forward pass — ideally speeding up decode without
losing accuracy.
For speculative decoding to work, the draft model must share the same tokenizer as the target model. A vocab mismatch means the draft's predictions don't map to the target's token IDs — acceptance rate drops to essentially zero and the overhead of running the draft makes things slower, not faster.
GLM-5 uses the Zhipu/GLM tokenizer with a vocabulary of 154,820 tokens
and special tokens like [gMASK] and <sop>. No small MLX-format model
shares this vocabulary. The candidates we evaluated:
| Candidate | Token Count | Tokenizer Family | Match? |
|---|---|---|---|
| Qwen3.5-35B-A3B-4bit (MLX, on disk) | 152,064 | Qwen | No |
| Llama-3.2-3B-Instruct-8bit (MLX, on disk) | 128,256 | Llama | No |
| parakeet-tdt-0.6b-v3 (on disk) | 8,192 | NVIDIA NeMo audio | No (speech model) |
| Any Qwen2.5 draft model | 151,936 | Qwen2.5 | No |
| Any Gemma / Mistral / DeepSeek draft | — | Different | No |
The parakeet-tdt-0.6b-v3 looked promising by name ("TDT" = trained draft model for
speculative decoding), but it is an audio speech recognition model (Conformer
encoder + RNNT decoder) with an 8,192-vocabulary — completely unsuitable as a text draft model.
Bottom line: Speculative decoding is blocked until Zhipu (or a third party) releases a small model using the GLM tokenizer in MLX format. Since GLM-5 itself has only been available for a few weeks and its MLX port is community-maintained, this is unlikely in the near term.
Other Paths Investigated
DSA (DeepSeek Sparse Attention)
GLM-5.2 uses the glm_moe_dsa arch, which supports DSA-style sparse attention.
However, MLX does not have Metal GPU kernels for DSA — the architecture name is inherited from
the model's training configuration, but MLX falls back to dense attention. Even if it worked,
DSA primarily benefits long-context prefill, and the prefill bottleneck is now largely addressed
by the step-size change. We did not investigate further.
llama.cpp
llama.cpp supports GLM-4 (via LLM_ARCH_GLM4_MOE) but we did not test GLM-5.2 on it
because the model would need a GGUF conversion. For the largest available GLM-5 GGUF quant
(DQ3, ~3.55 bits per weight), the file would be approximately 250 GB. The M3 Ultra has 512 GB
unified memory, so a llama.cpp load is technically possible, but:
- We don't have a GLM-5 GGUF on disk (~350 GB download for a speculative experiment).
- llama.cpp's DSA support only covers DeepSeek-V3.2, not the
GLM4_MOEarch. - The only theoretical advantage would be access to llama.cpp's draft model infrastructure — but the same tokenizer mismatch blocks it.
Verdict: not worth the download cost.
rapid-mlx (vllm_mlx)
rapid-mlx is faster than mlx_lm.server for agentic workloads (documented
at ~53 tok/s for Qwen3-Coder-Next), but it uses a different inference engine that would require
an additional MLX conversion pass for GLM-5.2, and the model would need to fit within rapid-mlx's
managed memory scheme. We did not test this path — the step-size improvement already closed most
of the prefill gap, and the remaining bottleneck (decode) is set by hardware bandwidth.
Updated Server Configuration
The current launchd plist (com.milo.glm52-mlx-server.plist) now runs:
/usr/bin/time -l python -m mlx_lm.server \
--model ~/models/GLM-5.2-mxfp4 \
--trust-remote-code \
--host 127.0.0.1 --port 8025 \
--max-tokens 2048 \
--prompt-cache-size 2 \
--prompt-cache-bytes 2147483648 \
--prefill-step-size 2048 \
--chat-template-args '{"enable_thinking":false,"reasoning_effort":null}' \
--temp 0.0 \
--log-level INFO
What's Next
- Zhipu releases a small GLM-tokenizer model: If and when a ~0.5B-3B model with the GLM vocabulary becomes available in MLX format, speculative decoding becomes viable. This would be the biggest remaining unlock for GLM-5.2 — potentially 2-3× decode speedup.
- MLX adds Metal DSA kernels: Unlikely in the near term, but would help especially at very long contexts (>64K tokens).
- Atlas inference engine: Atlas (Pure Rust) could theoretically serve GLM-5.2 with a different prefill/decode balance, but the model would need a SAFE-tensor-to-Atlas conversion. Not on the roadmap.
- Live monitoring: The server now runs under launchd with auto-restart. The proxy at
:8026is wired as them3u-glm52-8026Hermes provider (non-default). Smoke tests pass.