← al-engr.com GLM-5.2 Recipe Optimization

Running GLM-5.2 MXFP4 on an M3 Ultra with soloheaven

June 18, 2026 — Session KV cache, Prompt Lookup Decoding, and 5× decode improvement vs mlx_lm.server

This post documents the third phase of the GLM-5.2 experiment on our M3 Ultra (512 GB). After getting the model serving and tuning prefill-step-size, we needed a real serving infrastructure — session KV caching, faster decode, and production-grade lifecycle management.

Enter soloheaven: an MLX-native inference server built by the Hermes Agent team that wraps mlx_lm with session KV caching, Prompt Lookup Decoding (PLD), and process-managed memory budgets.

Update — June 19: the earlier verdict was too optimistic. Soloheaven still gives
~18 tok/s short-form decode and passes simple tool-call gates, but it failed
Terminal-Bench-style multi-turn agent transcripts. Once long terminal output entered the
conversation, responses degraded into numeric garbage while returning HTTP 200. Treat this
as not agent-ready for GLM-5.2 until the session KV/cache path is disabled or fixed.

What Soloheaven Adds

Session KV Cache — Disk-backed key-value cache persists across requests. 91% of prompt tokens cached on follow-up calls, cutting prefill from seconds to near-zero.
Prompt Lookup Decoding (PLD) — Skip-gram guessing during autoregressive decode. GLM-5.2 went from 3–8 tok/s to a stable 18+ tok/s on medium generations.
Memory Budget Management — Explicit --memory-budget-gb 10 controls cache memory without OOM risk.
Process-Mode Architecture — Optional child-process model loading for crash isolation.
JSON Schema Masking — Structured output via schema constraints.

Benchmarks

Prefill Speed (prefill-step-size 2048)

Measured via soloheaven API on port 8025, warm model. Smallest case includes Metal kernel compilation overhead.

Prompt Size	Prompt Tokens	Time	Rate
Tiny (10 words)	27	6.4 s	4 tok/s (cold startup)
Short (100 words)	207	1.9 s	111 tok/s
Medium (500 words)	1,007	5.3 s	191 tok/s
Long (2,000 words)	4,007	22.3 s	180 tok/s

The 6.4 s for the first tiny request is not representative — it includes Metal GPU kernel compilation and model warmup. Warm prefill settles at 180–191 tok/s, consistent with the prefill-step-size 2048 results from our optimization post.

Decode Speed

Generation Length	Time	Throughput
50 tokens	2.9 s	17.1 tok/s
100 tokens	5.5 s	18.2 tok/s
200 tokens	10.7 s	18.7 tok/s

Stable ~18 tok/s decode — a 3–5× improvement over stock mlx_lm.server's 3–8 tok/s. This is entirely the PLD benefit: soloheaven guesses upcoming tokens from the existing context and validates them, skipping full weight passes for successful guesses.

Session KV Cache

Call	Action	Prompt Tokens	Cached	New	Time
1	Build cache (explain TP)	20	0	20	5.6 s
2	Extend with follow-up	131	120	11	5.6 s
3	New follow-up (prefix match)	135	0 (evicted)	135	4.3 s

The session cache works — 120 of 131 prompt tokens (91%) reused from the previous conversation turn. The third call was a cache miss, likely due to eviction under the 10 GB budget (~18 tokens/GB for this 368 GB model; the M3 Ultra's memory bandwidth makes cache size tradeoffs steep).

First-Request Latency

The first API call after soloheaven starts (or after a period of inactivity) takes ~6 s regardless of prompt size — this is the Metal kernel compilation and GPU state warmup. Subsequent requests respond in 1–3 s for short prompts.

Autonomy Gates

Correction: these gates were shallow. GLM-5.2 on soloheaven passed simple load, completion, tool-call, and short roundtrip tests, but those did not predict sustained agent behavior. A later Terminal-Bench run reproduced cache-associated degeneration on multi-turn transcripts with long terminal output.

Gate	Result	Details
Model load	PASS	`strict=False` patch applied
Inference quality	PASS	No MXFP4 garbage, clean output
Native tool calls	PASS	`get_weather({"location":"Tokyo","units":"celsius"})`
Tool-result roundtrip	PASS	Accepts tool output, produces answer
Proxy normalization	PASS	Model name stripped and restored
Session KV cache	PASS	91% cache-hit rate on extended conversations

The One Patch: `strict=False`

GLM-5.2's model.safetensors contain Indexer projection weights for all 78 attention layers, but soloheaven's deepseek_v32.py model constructor uses the use_indexer logic from config.json — only 21 of 78 layers get Indexer objects. The remaining 57 × 5 = 285 weight entries are harmless orphans that don't map to any model parameter.

The fix is a 7-line monkey-patch in mlx_engine.py:

# In soloheaven's mlx_engine.py, before the MLXEngine class:
import mlx_lm.utils as _mlx_lm_utils

_patched_lm_load_model = lambda *a, **kw: _mlx_lm_utils.load_model(*a, **kw, strict=False)

if not hasattr(_mlx_lm_utils, '_echo_patched'):
    _mlx_lm_utils.load_model = _patched_lm_load_model
    _mlx_lm_utils._echo_patched = True

This runs at import time, so even soloheaven's process_worker child process inherits the patched function.

Deployment: launchd Service

Both soloheaven and the strip-model proxy run as launchd agents for automatic startup and crash recovery:

Soloheaven Server (`com.echo.glm52-soloheaven.plist`)

<key>ProgramArguments</key>
<array>
  <string>python</string>
  <string>-m</string><string>mlx_soloheaven</string>
  <string>--model</string><string>~/models/GLM-5.2-mxfp4</string>
  <string>--port</string><string>8025</string>
  <string>--host</string><string>127.0.0.1</string>
  <string>--no-thinking</string>
  <string>--memory-budget-gb</string><string>10</string>
  <string>--prefill-step-size</string><string>2048</string>
  <string>--gpu-keepalive</string>
</array>

Strip-Model Proxy (`com.milo.glm52-strip-proxy.plist`)

A threaded HTTP server on 0.0.0.0:8026 strips the model field from incoming requests (since soloheaven rejects unknown model names) and normalizes it back in responses. Uses curl subprocess forwarding to avoid the urllib hang issues seen with mlx_lm.server.

What Didn't Change

Model quality: Same GLM-5.2 MXFP4. Soloheaven doesn't alter weights or quantization.
Hardware limit: The 368 GB model still saturates the M3 Ultra's 400 GB/s memory bandwidth. Decode can't exceed ~2 theoretical weight passes/s per token without PLD.
Speculative decoding: Still blocked — no small MLX model exists with GLM's 154,820-vocab Zhipu tokenizer. PLD fills part of this gap.
Tool-call behavior: Simple tool calls still work, but sustained agent loops regressed under Terminal-Bench.

Migration Guide (mlx_lm.server → soloheaven)

# 1. Install soloheaven
git clone https://github.com/nousresearch/hermes ~/mlx-soloheaven

# 2. Create venv & install
python3 -m venv ~/venvs/mlx-soloheaven
source ~/venvs/mlx-soloheaven/bin/activate
cd ~/mlx-soloheaven
pip install -e .

# 3. Apply strict=False patch (see above)

# 4. Start server
mlx-soloheaven \
  --model ~/models/GLM-5.2-mxfp4 \
  --port 8025 \
  --host 127.0.0.1 \
  --no-thinking \
  --memory-budget-gb 10 \
  --prefill-step-size 2048 \
  --gpu-keepalive

# 5. Start proxy (separate terminal or launchd)
python3 ~/scripts/glm52_strip_model_proxy_threaded.py

# 6. Test
curl http://127.0.0.1:8026/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"glm-5.2","messages":[{"role":"user","content":"Hello"}]}'

Summary

Metric	mlx_lm.server	soloheaven	Δ
Decode throughput	3–8 tok/s	~18 tok/s	+3–5×
Session KV cache	❌ Stateless (prompt-cache only)	✅ Disk-backed session cache	91% hit rate
Model load (cold)	~60 s	~60 s	≈
First API call	~6 s (kernel warmup)	~6 s	≈
Tool calls	✅	✅	≈
Crash recovery	launchd	launchd + KeepAlive	≈
Memory management	Manual prompt-cache-bytes	Auto budget with disk overflow	✅

Bottom line: soloheaven is useful as a fast short-form GLM-5.2 lab server, but I no longer recommend it for agentic GLM-5.2 workloads. The stock mlx_lm.server path completed Terminal-Bench; soloheaven's session/cache path did not. Keep it out of published agent benchmarks until that cache behavior is understood and fixed.

Running GLM-5.2 MXFP4 on an M3 Ultra with MLX — Getting started
GLM-5.2 Optimization: Prefill-Step-Size Tuning — Prefill tuning and spec-decode blockers