June 18, 2026 — Session KV cache, Prompt Lookup Decoding, and 5× decode improvement vs mlx_lm.server
This post documents the third phase of the GLM-5.2 experiment on our M3 Ultra (512 GB). After getting the model serving and tuning prefill-step-size, we needed a real serving infrastructure — session KV caching, faster decode, and production-grade lifecycle management.
Enter soloheaven: an MLX-native inference server built by the
Hermes Agent team that wraps
mlx_lm with session KV caching, Prompt Lookup Decoding (PLD), and
process-managed memory budgets.
--memory-budget-gb 10 controls
cache memory without OOM risk.Measured via soloheaven API on port 8025, warm model. Smallest case includes Metal kernel compilation overhead.
| Prompt Size | Prompt Tokens | Time | Rate |
|---|---|---|---|
| Tiny (10 words) | 27 | 6.4 s | 4 tok/s (cold startup) |
| Short (100 words) | 207 | 1.9 s | 111 tok/s |
| Medium (500 words) | 1,007 | 5.3 s | 191 tok/s |
| Long (2,000 words) | 4,007 | 22.3 s | 180 tok/s |
The 6.4 s for the first tiny request is not representative — it includes Metal GPU kernel compilation and model warmup. Warm prefill settles at 180–191 tok/s, consistent with the prefill-step-size 2048 results from our optimization post.
| Generation Length | Time | Throughput |
|---|---|---|
| 50 tokens | 2.9 s | 17.1 tok/s |
| 100 tokens | 5.5 s | 18.2 tok/s |
| 200 tokens | 10.7 s | 18.7 tok/s |
Stable ~18 tok/s decode — a 3–5× improvement over stock
mlx_lm.server's 3–8 tok/s. This is entirely the PLD benefit: soloheaven
guesses upcoming tokens from the existing context and validates them, skipping full
weight passes for successful guesses.
| Call | Action | Prompt Tokens | Cached | New | Time |
|---|---|---|---|---|---|
| 1 | Build cache (explain TP) | 20 | 0 | 20 | 5.6 s |
| 2 | Extend with follow-up | 131 | 120 | 11 | 5.6 s |
| 3 | New follow-up (prefix match) | 135 | 0 (evicted) | 135 | 4.3 s |
The session cache works — 120 of 131 prompt tokens (91%) reused from the previous conversation turn. The third call was a cache miss, likely due to eviction under the 10 GB budget (~18 tokens/GB for this 368 GB model; the M3 Ultra's memory bandwidth makes cache size tradeoffs steep).
The first API call after soloheaven starts (or after a period of inactivity) takes ~6 s regardless of prompt size — this is the Metal kernel compilation and GPU state warmup. Subsequent requests respond in 1–3 s for short prompts.
Correction: these gates were shallow. GLM-5.2 on soloheaven passed simple load, completion, tool-call, and short roundtrip tests, but those did not predict sustained agent behavior. A later Terminal-Bench run reproduced cache-associated degeneration on multi-turn transcripts with long terminal output.
| Gate | Result | Details |
|---|---|---|
| Model load | PASS | strict=False patch applied |
| Inference quality | PASS | No MXFP4 garbage, clean output |
| Native tool calls | PASS | get_weather({"location":"Tokyo","units":"celsius"}) |
| Tool-result roundtrip | PASS | Accepts tool output, produces answer |
| Proxy normalization | PASS | Model name stripped and restored |
| Session KV cache | PASS | 91% cache-hit rate on extended conversations |
strict=FalseGLM-5.2's model.safetensors contain Indexer projection weights for all
78 attention layers, but soloheaven's deepseek_v32.py model constructor
uses the use_indexer logic from config.json — only 21 of 78
layers get Indexer objects. The remaining 57 × 5 = 285 weight entries are harmless
orphans that don't map to any model parameter.
The fix is a 7-line monkey-patch in mlx_engine.py:
# In soloheaven's mlx_engine.py, before the MLXEngine class:
import mlx_lm.utils as _mlx_lm_utils
_patched_lm_load_model = lambda *a, **kw: _mlx_lm_utils.load_model(*a, **kw, strict=False)
if not hasattr(_mlx_lm_utils, '_echo_patched'):
_mlx_lm_utils.load_model = _patched_lm_load_model
_mlx_lm_utils._echo_patched = True
This runs at import time, so even soloheaven's process_worker child
process inherits the patched function.
Both soloheaven and the strip-model proxy run as launchd agents for automatic startup and crash recovery:
com.echo.glm52-soloheaven.plist)<key>ProgramArguments</key> <array> <string>python</string> <string>-m</string><string>mlx_soloheaven</string> <string>--model</string><string>~/models/GLM-5.2-mxfp4</string> <string>--port</string><string>8025</string> <string>--host</string><string>127.0.0.1</string> <string>--no-thinking</string> <string>--memory-budget-gb</string><string>10</string> <string>--prefill-step-size</string><string>2048</string> <string>--gpu-keepalive</string> </array>
com.milo.glm52-strip-proxy.plist)A threaded HTTP server on 0.0.0.0:8026 strips the model
field from incoming requests (since soloheaven rejects unknown model names) and
normalizes it back in responses. Uses curl subprocess forwarding to avoid
the urllib hang issues seen with mlx_lm.server.
# 1. Install soloheaven
git clone https://github.com/nousresearch/hermes ~/mlx-soloheaven
# 2. Create venv & install
python3 -m venv ~/venvs/mlx-soloheaven
source ~/venvs/mlx-soloheaven/bin/activate
cd ~/mlx-soloheaven
pip install -e .
# 3. Apply strict=False patch (see above)
# 4. Start server
mlx-soloheaven \
--model ~/models/GLM-5.2-mxfp4 \
--port 8025 \
--host 127.0.0.1 \
--no-thinking \
--memory-budget-gb 10 \
--prefill-step-size 2048 \
--gpu-keepalive
# 5. Start proxy (separate terminal or launchd)
python3 ~/scripts/glm52_strip_model_proxy_threaded.py
# 6. Test
curl http://127.0.0.1:8026/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"glm-5.2","messages":[{"role":"user","content":"Hello"}]}'
| Metric | mlx_lm.server | soloheaven | Δ |
|---|---|---|---|
| Decode throughput | 3–8 tok/s | ~18 tok/s | +3–5× |
| Session KV cache | ❌ Stateless (prompt-cache only) | ✅ Disk-backed session cache | 91% hit rate |
| Model load (cold) | ~60 s | ~60 s | ≈ |
| First API call | ~6 s (kernel warmup) | ~6 s | ≈ |
| Tool calls | ✅ | ✅ | ≈ |
| Crash recovery | launchd | launchd + KeepAlive | ≈ |
| Memory management | Manual prompt-cache-bytes | Auto budget with disk overflow | ✅ |
Bottom line: soloheaven is useful as a fast short-form GLM-5.2 lab server, but I no longer recommend it for agentic GLM-5.2 workloads. The stock mlx_lm.server path completed Terminal-Bench; soloheaven's session/cache path did not. Keep it out of published agent benchmarks until that cache behavior is understood and fixed.
© 2026 al-engr.com — James Meadlock. All data from live M3 Ultra measurements.