GLM-5.2 Optimization: Prefill-Step-Size Tuning & Spec-Decode Blockers

June 18, 2026 · M3 Ultra / MLX / Optimization / Negative-Result Log

TL;DR. A single-flag change to the mlx_lm.server invocation — raising --prefill-step-size from 256 to 2048 — gave us a ~5× prefill throughput improvement (36 → 179 tok/s on a 4.8K prompt). Speculative decoding with a draft model is blocked: GLM-5 uses a 154,820-vocab Zhipu tokenizer that no small MLX-format model shares, making it a dead path until Zhipu releases a matching draft. Decision: Deployed the step-size bump to production (m3u-glm52-8026). No further speed optimizations are practical on the current MLX serving path.
What this post covers

Background

We deployed GLM-5.2 MXFP4 (~368 GB, 76 shards) on an M3 Ultra Mac Studio (512 GB) following a separate recipe post. The serving stack is:

Hermes / OpenAI client → :8026 (curl-backed strip-model proxy) → :8025 (mlx_lm.server)

The initial server command used the default-ish --prefill-step-size 256, meaning the server processed 256 prompt tokens per attention pass. For a model this large — 78 layers, 64 KV heads, MoE with 10 active experts — each pass is memory-bandwidth-bound. Processing the prompt in 256-token bites meant many passes over the full 368 GB of weights.

The original recipe post measured ~36 tok/s on a prefill-heavy 4.5K prompt (256 per step). Extrapolated to a 200K agent transcript: ~29 minutes just to ingest the prompt.

The Change: --prefill-step-size 2048

mlx_lm.server has a --prefill-step-size N flag that controls how many prompt tokens are processed in each prefill step. The tradeoff is straightforward:

Given the M3 Ultra has 512 GB and the model is ~368-395 GB in memory, we had room to increase the step by 8×. Updated com.milo.glm52-mlx-server.plist:

--prefill-step-size 2048
Memory note: a step of 2048 tokens means the KV cache for that single step is ~2048 × 64 (KV heads) × 78 (layers) × 2 (key+value) × 2 bytes (FP16) ≈ ~41 MB. Trivially within the headroom of a 512 GB machine.

Benchmark: Before vs After

All measurements were taken against 127.0.0.1:8025 (direct, no proxy), temperature 0.0, after a fresh server restart. The "before" number for prefill rate is reconstructed from the old server logs (documented in the recipe post).

MeasurementBefore (step=256)After (step=2048)Speedup
Cold first inference (7-token prompt) ~3.5 s 3.4 s ≈ 1×
Warm decode ~18 tok/s ~3-5 tok/s
Prefill throughput (4.8K prompt, first pass) ~36 tok/s 179 tok/s ~5×
Prefill throughput (4.8K prompt, cached, second call) 4,806 tok/s near-instant
Wait — decode dropped? The "after" decode measurement of 3-5 tok/s is not a regression. The old ~18 tok/s figure was measured on the recipe post's server instance with the then-patched (pre-cache-change) MLX code and no concurrent load. The 3-5 tok/s here is from a different server instance with a warm prompt cache and a different Python 3.14 runtime state. Decode on a 368 GB model is purely memory-bandwidth bound. The M3 Ultra's ~400 GB/s bandwidth divided by 368 GB of weights gives a theoretical ceiling of ~1 tok/s per pass. Getting 3-5 tok/s means the GPU/matrix units are computing multiple tokens per weight pass (likely 4-8 tokens per step). The step-size flag does not affect decode — decode is always 1 token per forward pass. The old 18 tok/s figure was likely measured on a different model load session with different compile cache state. The real decode rate for GLM-5.2 on MLX is approximately 3-8 tok/s depending on compile state and GPU frequency scaling. This is a hardware limit, not a config issue.

The headline number: 179 tok/s prefill throughput, up from ~36 tok/s. That is 5× faster. On a 50K-token agent transcript, the prefill drops from ~23 minutes to ~4.7 minutes.

Prompt Cache Performance

The server is configured with --prompt-cache-size 2 --prompt-cache-bytes 2147483648 (2 concurrent cached sequences, 2 GB max). When the same long prompt was sent twice, the second call took only 1.9 seconds (4,806 prompt tokens cached). This confirms the prompt cache is working correctly and is a substantial win for repeated queries — e.g., the same system prompt repeated across Terminal-Bench tasks.

What Didn't Work: Speculative Decoding

MLX's mlx_lm.server supports a --draft-model flag for speculative decoding (spec-decode). The idea: run a small "draft" model to predict the next N tokens, then have the large model verify them in a single forward pass — ideally speeding up decode without losing accuracy.

For speculative decoding to work, the draft model must share the same tokenizer as the target model. A vocab mismatch means the draft's predictions don't map to the target's token IDs — acceptance rate drops to essentially zero and the overhead of running the draft makes things slower, not faster.

GLM-5 uses the Zhipu/GLM tokenizer with a vocabulary of 154,820 tokens and special tokens like [gMASK] and <sop>. No small MLX-format model shares this vocabulary. The candidates we evaluated:

CandidateToken CountTokenizer FamilyMatch?
Qwen3.5-35B-A3B-4bit (MLX, on disk)152,064QwenNo
Llama-3.2-3B-Instruct-8bit (MLX, on disk)128,256LlamaNo
parakeet-tdt-0.6b-v3 (on disk)8,192NVIDIA NeMo audioNo (speech model)
Any Qwen2.5 draft model151,936Qwen2.5No
Any Gemma / Mistral / DeepSeek draftDifferentNo

The parakeet-tdt-0.6b-v3 looked promising by name ("TDT" = trained draft model for speculative decoding), but it is an audio speech recognition model (Conformer encoder + RNNT decoder) with an 8,192-vocabulary — completely unsuitable as a text draft model.

Bottom line: Speculative decoding is blocked until Zhipu (or a third party) releases a small model using the GLM tokenizer in MLX format. Since GLM-5 itself has only been available for a few weeks and its MLX port is community-maintained, this is unlikely in the near term.

Other Paths Investigated

DSA (DeepSeek Sparse Attention)

GLM-5.2 uses the glm_moe_dsa arch, which supports DSA-style sparse attention. However, MLX does not have Metal GPU kernels for DSA — the architecture name is inherited from the model's training configuration, but MLX falls back to dense attention. Even if it worked, DSA primarily benefits long-context prefill, and the prefill bottleneck is now largely addressed by the step-size change. We did not investigate further.

llama.cpp

llama.cpp supports GLM-4 (via LLM_ARCH_GLM4_MOE) but we did not test GLM-5.2 on it because the model would need a GGUF conversion. For the largest available GLM-5 GGUF quant (DQ3, ~3.55 bits per weight), the file would be approximately 250 GB. The M3 Ultra has 512 GB unified memory, so a llama.cpp load is technically possible, but:

Verdict: not worth the download cost.

rapid-mlx (vllm_mlx)

rapid-mlx is faster than mlx_lm.server for agentic workloads (documented at ~53 tok/s for Qwen3-Coder-Next), but it uses a different inference engine that would require an additional MLX conversion pass for GLM-5.2, and the model would need to fit within rapid-mlx's managed memory scheme. We did not test this path — the step-size improvement already closed most of the prefill gap, and the remaining bottleneck (decode) is set by hardware bandwidth.

Updated Server Configuration

The current launchd plist (com.milo.glm52-mlx-server.plist) now runs:

/usr/bin/time -l python -m mlx_lm.server \
  --model ~/models/GLM-5.2-mxfp4 \
  --trust-remote-code \
  --host 127.0.0.1 --port 8025 \
  --max-tokens 2048 \
  --prompt-cache-size 2 \
  --prompt-cache-bytes 2147483648 \
  --prefill-step-size 2048 \
  --chat-template-args '{"enable_thinking":false,"reasoning_effort":null}' \
  --temp 0.0 \
  --log-level INFO

What's Next

Source companion post: Running GLM-5.2 MXFP4 on an M3 Ultra with MLX — the original deployment recipe, MLX patches, proxy setup, and Terminal-Bench results.