Running GLM-5.2 MXFP4 on an M3 Ultra with soloheaven

June 18, 2026 — Session KV cache, Prompt Lookup Decoding, and 5× decode improvement vs mlx_lm.server

This post documents the third phase of the GLM-5.2 experiment on our M3 Ultra (512 GB). After getting the model serving and tuning prefill-step-size, we needed a real serving infrastructure — session KV caching, faster decode, and production-grade lifecycle management.

Enter soloheaven: an MLX-native inference server built by the Hermes Agent team that wraps mlx_lm with session KV caching, Prompt Lookup Decoding (PLD), and process-managed memory budgets.

Update — June 19: the earlier verdict was too optimistic. Soloheaven still gives ~18 tok/s short-form decode and passes simple tool-call gates, but it failed Terminal-Bench-style multi-turn agent transcripts. Once long terminal output entered the conversation, responses degraded into numeric garbage while returning HTTP 200. Treat this as not agent-ready for GLM-5.2 until the session KV/cache path is disabled or fixed.

What Soloheaven Adds

  1. Session KV Cache — Disk-backed key-value cache persists across requests. 91% of prompt tokens cached on follow-up calls, cutting prefill from seconds to near-zero.
  2. Prompt Lookup Decoding (PLD) — Skip-gram guessing during autoregressive decode. GLM-5.2 went from 3–8 tok/s to a stable 18+ tok/s on medium generations.
  3. Memory Budget Management — Explicit --memory-budget-gb 10 controls cache memory without OOM risk.
  4. Process-Mode Architecture — Optional child-process model loading for crash isolation.
  5. JSON Schema Masking — Structured output via schema constraints.

Benchmarks

Prefill Speed (prefill-step-size 2048)

Measured via soloheaven API on port 8025, warm model. Smallest case includes Metal kernel compilation overhead.

Prompt SizePrompt TokensTimeRate
Tiny (10 words)276.4 s4 tok/s (cold startup)
Short (100 words)2071.9 s111 tok/s
Medium (500 words)1,0075.3 s191 tok/s
Long (2,000 words)4,00722.3 s180 tok/s

The 6.4 s for the first tiny request is not representative — it includes Metal GPU kernel compilation and model warmup. Warm prefill settles at 180–191 tok/s, consistent with the prefill-step-size 2048 results from our optimization post.

Decode Speed

Generation LengthTimeThroughput
50 tokens2.9 s17.1 tok/s
100 tokens5.5 s18.2 tok/s
200 tokens10.7 s18.7 tok/s

Stable ~18 tok/s decode — a 3–5× improvement over stock mlx_lm.server's 3–8 tok/s. This is entirely the PLD benefit: soloheaven guesses upcoming tokens from the existing context and validates them, skipping full weight passes for successful guesses.

Session KV Cache

CallActionPrompt TokensCachedNewTime
1Build cache (explain TP)200205.6 s
2Extend with follow-up131120115.6 s
3New follow-up (prefix match)1350 (evicted)1354.3 s

The session cache works — 120 of 131 prompt tokens (91%) reused from the previous conversation turn. The third call was a cache miss, likely due to eviction under the 10 GB budget (~18 tokens/GB for this 368 GB model; the M3 Ultra's memory bandwidth makes cache size tradeoffs steep).

First-Request Latency

The first API call after soloheaven starts (or after a period of inactivity) takes ~6 s regardless of prompt size — this is the Metal kernel compilation and GPU state warmup. Subsequent requests respond in 1–3 s for short prompts.

Autonomy Gates

Correction: these gates were shallow. GLM-5.2 on soloheaven passed simple load, completion, tool-call, and short roundtrip tests, but those did not predict sustained agent behavior. A later Terminal-Bench run reproduced cache-associated degeneration on multi-turn transcripts with long terminal output.

GateResultDetails
Model loadPASSstrict=False patch applied
Inference qualityPASSNo MXFP4 garbage, clean output
Native tool callsPASSget_weather({"location":"Tokyo","units":"celsius"})
Tool-result roundtripPASSAccepts tool output, produces answer
Proxy normalizationPASSModel name stripped and restored
Session KV cachePASS91% cache-hit rate on extended conversations

The One Patch: strict=False

GLM-5.2's model.safetensors contain Indexer projection weights for all 78 attention layers, but soloheaven's deepseek_v32.py model constructor uses the use_indexer logic from config.json — only 21 of 78 layers get Indexer objects. The remaining 57 × 5 = 285 weight entries are harmless orphans that don't map to any model parameter.

The fix is a 7-line monkey-patch in mlx_engine.py:

# In soloheaven's mlx_engine.py, before the MLXEngine class:
import mlx_lm.utils as _mlx_lm_utils

_patched_lm_load_model = lambda *a, **kw: _mlx_lm_utils.load_model(*a, **kw, strict=False)

if not hasattr(_mlx_lm_utils, '_echo_patched'):
    _mlx_lm_utils.load_model = _patched_lm_load_model
    _mlx_lm_utils._echo_patched = True

This runs at import time, so even soloheaven's process_worker child process inherits the patched function.

Deployment: launchd Service

Both soloheaven and the strip-model proxy run as launchd agents for automatic startup and crash recovery:

Soloheaven Server (com.echo.glm52-soloheaven.plist)

<key>ProgramArguments</key>
<array>
  <string>python</string>
  <string>-m</string><string>mlx_soloheaven</string>
  <string>--model</string><string>~/models/GLM-5.2-mxfp4</string>
  <string>--port</string><string>8025</string>
  <string>--host</string><string>127.0.0.1</string>
  <string>--no-thinking</string>
  <string>--memory-budget-gb</string><string>10</string>
  <string>--prefill-step-size</string><string>2048</string>
  <string>--gpu-keepalive</string>
</array>

Strip-Model Proxy (com.milo.glm52-strip-proxy.plist)

A threaded HTTP server on 0.0.0.0:8026 strips the model field from incoming requests (since soloheaven rejects unknown model names) and normalizes it back in responses. Uses curl subprocess forwarding to avoid the urllib hang issues seen with mlx_lm.server.

What Didn't Change

Migration Guide (mlx_lm.server → soloheaven)

# 1. Install soloheaven
git clone https://github.com/nousresearch/hermes ~/mlx-soloheaven

# 2. Create venv & install
python3 -m venv ~/venvs/mlx-soloheaven
source ~/venvs/mlx-soloheaven/bin/activate
cd ~/mlx-soloheaven
pip install -e .

# 3. Apply strict=False patch (see above)

# 4. Start server
mlx-soloheaven \
  --model ~/models/GLM-5.2-mxfp4 \
  --port 8025 \
  --host 127.0.0.1 \
  --no-thinking \
  --memory-budget-gb 10 \
  --prefill-step-size 2048 \
  --gpu-keepalive

# 5. Start proxy (separate terminal or launchd)
python3 ~/scripts/glm52_strip_model_proxy_threaded.py

# 6. Test
curl http://127.0.0.1:8026/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"glm-5.2","messages":[{"role":"user","content":"Hello"}]}'

Summary

Metricmlx_lm.serversoloheavenΔ
Decode throughput3–8 tok/s~18 tok/s+3–5×
Session KV cache❌ Stateless (prompt-cache only)✅ Disk-backed session cache91% hit rate
Model load (cold)~60 s~60 s
First API call~6 s (kernel warmup)~6 s
Tool calls
Crash recoverylaunchdlaunchd + KeepAlive
Memory managementManual prompt-cache-bytesAuto budget with disk overflow

Bottom line: soloheaven is useful as a fast short-form GLM-5.2 lab server, but I no longer recommend it for agentic GLM-5.2 workloads. The stock mlx_lm.server path completed Terminal-Bench; soloheaven's session/cache path did not. Keep it out of published agent benchmarks until that cache behavior is understood and fixed.

Related Posts


© 2026 al-engr.com — James Meadlock. All data from live M3 Ultra measurements.