Running GLM-5.2 MXFP4 on an M3 Ultra with MLX
--prefill-step-size (see the optimization post). At that rate, a 200k-token agent transcript would spend roughly 29 minutes just in prompt processing before useful generation. That makes it a lab curiosity / slow-orchestrator candidate, not a practical default route.
terminal-bench-core==0.1.1 with terminus-2, n-concurrent=1, and the default 1× task timeouts. That is 28.75%, with a wall-clock run time of about 12h28m. This is now a measured result, not a bring-up estimate.
m3u-glm52-8026 (non-default), the trust gates passed: native tool call, tool-result roundtrip, 3-way queue smoke, proxy normalization, and three Hermes child-agent tasks. The proxy also changed from a Python urllib forwarder to a curl-backed LAN-visible shim because the urllib path could leave mlx_lm.server stuck after client timeouts.
- The hardware and model layout we used.
- The MLX patches needed to load GLM-5.2 MXFP4.
- The server command that actually brought the model up.
- The OpenAI/LiteLLM proxy needed for Terminal-Bench.
- Measured speed, memory behavior, and the final Terminal-Bench result.
- How we keep the server/proxy running under launchd.
- What is still broken or not good enough.
Hardware and starting point
The target machine is a Mac Studio M3 Ultra with 512 GB unified memory. The model is GLM-5.2-mxfp4, stored locally under ~/models/GLM-5.2-mxfp4. On this machine the directory is about 368 GB on disk with 76 safetensor shards. During our first successful load/generation smoke test, MLX reported a peak around 395 GB. During the later benchmark run, the Python server process showed about 309 GB RSS in ps. Those are different measurements, but both matter: disk footprint, resident process size, and MLX peak memory are not the same thing.
| Component | Value we used |
|---|---|
| Host | M3 Ultra Mac Studio, 512 GB unified memory |
| Model path | ~/models/GLM-5.2-mxfp4 |
| Model size on disk | ~368 GB |
| Shard count | 76 .safetensors files |
| Python runtime for patched server | Homebrew Python 3.14 inside ~/venvs/glm52-mlx-patch |
| Public-ish API route used by tools | LiteLLM / OpenAI client → :8026 threaded strip-model proxy → :8025 mlx_lm.server |
1. Download the model
Use the Hugging Face CLI. Do not write a custom Python downloader for this; the CLI already handles auth, retries, cache layout, and resumability better than a one-off script.
# Example shape — adjust the repo id if the published path changes.
huggingface-cli download MODEL_OR_ORG/GLM-5.2-mxfp4 \
--local-dir ~/models/GLM-5.2-mxfp4 \
--local-dir-use-symlinks False
# Sanity check
find ~/models/GLM-5.2-mxfp4 -name '*.safetensors' | wc -l
du -sh ~/models/GLM-5.2-mxfp4
Our local copy had 76 safetensor shards. If your count is lower or you still have .incomplete files, stop and finish the download before debugging MLX.
2. Create a patched MLX environment
We used a separate virtual environment so the patches did not contaminate the production MLX install. On our system, Python is externally-managed, so use a venv instead of installing into the global interpreter.
python3 -m venv ~/venvs/glm52-mlx-patch
source ~/venvs/glm52-mlx-patch/bin/activate
python -m pip install -U pip
python -m pip install -U mlx mlx-lm transformers huggingface_hub
3. Patch GLM-5.2 support in MLX-LM
The model did not load cleanly for us with the unmodified code. The failing path was the DeepSeek V3.2 / GLM MoE DSA implementation expecting indexer behavior that was not present for every GLM layer. The fix was to preserve indexer_types from the config and make the per-layer indexer optional. Then make_cache() must allocate one KV cache for layers without an indexer and two caches for layers with an indexer.
Patch A: preserve indexer_types in glm_moe_dsa.py
# File inside the venv:
# ~/venvs/glm52-mlx-patch/lib/python3.14/site-packages/mlx_lm/models/glm_moe_dsa.py
class ModelArgs(BaseModelArgs):
...
attention_bias: bool
indexer_types: Optional[list] = None
rope_scaling: Dict = None
rope_theta: Optional[float] = None
Patch B: make the DeepSeek/GLM indexer optional per layer
# File inside the venv:
# ~/venvs/glm52-mlx-patch/lib/python3.14/site-packages/mlx_lm/models/deepseek_v32.py
# In ModelArgs:
indexer_types: Optional[list] = None
# In the attention layer __init__, after rope scaling setup:
indexer_types = getattr(config, "indexer_types", None)
use_indexer = (
indexer_types is None
or layer_idx >= len(indexer_types)
or indexer_types[layer_idx] == "full"
)
self.indexer = Indexer(config) if use_indexer else None
Patch C: allocate cache shape based on whether a layer has an indexer
# In deepseek_v32.py, Model.make_cache:
def make_cache(self):
caches = []
for layer in self.layers:
if getattr(layer.self_attn, "indexer", None) is None:
caches.append(CacheList(KVCache()))
else:
caches.append(CacheList(KVCache(), KVCache()))
return caches
That was enough to make a local mlx_lm generate smoke test work for us. If your first run fails before serving, validate this path directly before adding API servers, proxies, or benchmarks.
source ~/venvs/glm52-mlx-patch/bin/activate
python -m mlx_lm generate \
--model ~/models/GLM-5.2-mxfp4 \
--trust-remote-code \
--prompt "Reply with exactly OK." \
--max-tokens 16
4. Start the MLX server
This is the server command we used. It binds only to localhost. The chat template arguments disable thinking mode because we were testing agent/tool-loop behavior and wanted direct answers.
source ~/venvs/glm52-mlx-patch/bin/activate
mkdir -p ~/models/_logs
/usr/bin/time -l python -m mlx_lm.server \
--model ~/models/GLM-5.2-mxfp4 \
--trust-remote-code \
--host 127.0.0.1 \
--port 8025 \
--max-tokens 2048 \
--prompt-cache-size 2 \
--prompt-cache-bytes 2147483648 \
--prefill-step-size 2048 \
--chat-template-args '{"enable_thinking":false,"reasoning_effort":null}' \
--temp 0.0 \
--log-level INFO \
> ~/models/_logs/glm52_tb2_server8025.log 2>&1
For the benchmark run we started it manually so failures were obvious. After the run, we kept GLM up and converted the server into a LaunchAgent so it can start at login. That means this recipe now describes the live route we used, not just a temporary test process.
5. Add a threaded strip-model proxy
This is the stupid but necessary adapter. LiteLLM and Terminal-Bench send a model field like openai/glm52. The MLX server path we used tried to interpret that model field and could 404 instead of treating the loaded model as fixed. The proxy removes only the model key before forwarding to the real server.
Our first proxy was single-threaded. That was a mistake. During a long Terminal-Bench run, one stuck client socket blocked the whole proxy and left the M3 Ultra GPU idle. The second proxy was threaded, but still used Python urllib to call mlx_lm.server; after a client timeout, that path could wedge a generation for minutes. The current live proxy binds on 0.0.0.0:8026, uses curl subprocesses for upstream calls, strips the incoming model field, and normalizes the returned model name back to glm-5.2.
cat > /tmp/glm52_strip_model_proxy_threaded.py <<'PY'
#!/usr/bin/env python3
"""Threaded OpenAI-compatible proxy for GLM-5.2 on mlx_lm.server.
- Binds LAN-visible on :8026.
- Strips the client-supplied `model` field before forwarding to mlx_lm.server.
- Uses curl subprocesses rather than urllib for POSTs; mlx_lm.server has shown
hangs with Python urllib clients while curl is reliable.
- Normalizes response `model` back to the requested model or glm-5.2.
"""
from __future__ import annotations
import json
import subprocess
import tempfile
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from pathlib import Path
UPSTREAM = "http://127.0.0.1:8025"
HOST = "0.0.0.0"
PORT = 8026
DEFAULT_MODEL = "glm-5.2"
class Handler(BaseHTTPRequestHandler):
protocol_version = "HTTP/1.1"
def _send(self, status: int, body: bytes, content_type: str = "application/json"):
self.send_response(status)
self.send_header("Content-Type", content_type)
self.send_header("Content-Length", str(len(body)))
self.send_header("Connection", "close")
self.end_headers()
self.wfile.write(body)
self.close_connection = True
def _curl(self, method: str, path: str, body: bytes | None = None, timeout: int = 900):
url = UPSTREAM + path
cmd = ["curl", "-sS", "--max-time", str(timeout), "-w", "\n__HTTP_STATUS__:%{http_code}", "-X", method]
tmp = None
try:
if body is not None:
tmp = tempfile.NamedTemporaryFile(delete=False)
tmp.write(body)
tmp.close()
cmd += ["-H", "Content-Type: application/json", "-d", "@" + tmp.name]
cmd.append(url)
p = subprocess.run(cmd, text=False, capture_output=True, timeout=timeout + 10)
out = p.stdout or b""
marker = b"\n__HTTP_STATUS__:"
if marker in out:
payload, status_b = out.rsplit(marker, 1)
try:
status = int(status_b.strip() or b"502")
except Exception:
status = 502
else:
payload, status = out, 502
if p.returncode != 0 and not payload:
payload = json.dumps({"error": (p.stderr or b"").decode("utf-8", "replace")}).encode()
status = 502
return status, payload
finally:
if tmp is not None:
Path(tmp.name).unlink(missing_ok=True)
def do_GET(self):
try:
status, payload = self._curl("GET", self.path, timeout=60)
self._send(status, payload)
except BrokenPipeError:
pass
except Exception as e:
self._send(502, json.dumps({"error": repr(e)}).encode())
def do_POST(self):
requested_model = DEFAULT_MODEL
try:
length = int(self.headers.get("Content-Length", "0"))
raw = self.rfile.read(length)
body = json.loads(raw.decode("utf-8")) if raw else {}
if self.path.rstrip("/") == "/v1/chat/completions" and isinstance(body, dict):
requested_model = body.pop("model", None) or DEFAULT_MODEL
data = json.dumps(body).encode("utf-8")
status, payload = self._curl("POST", self.path, data, timeout=900)
# Normalize model field in successful JSON responses.
if status == 200:
try:
obj = json.loads(payload.decode("utf-8"))
if isinstance(obj, dict) and "model" in obj:
obj["model"] = requested_model
payload = json.dumps(obj).encode("utf-8")
except Exception:
pass
self._send(status, payload)
except BrokenPipeError:
pass
except Exception as e:
self._send(502, json.dumps({"error": repr(e)}).encode())
def log_message(self, fmt, *args):
print(f"{self.address_string()} - - [{self.log_date_time_string()}] " + fmt % args, flush=True)
if __name__ == "__main__":
srv = ThreadingHTTPServer((HOST, PORT), Handler)
srv.daemon_threads = True
print(f"curl-backed strip-model proxy listening on http://{HOST}:{PORT} -> {UPSTREAM}", flush=True)
srv.serve_forever()
PY
nohup python3 /tmp/glm52_strip_model_proxy_threaded.py \
> ~/models/_logs/glm52_strip_proxy8026.log 2>&1 &
Keep it up at login with launchd
Our current M3 Ultra state keeps GLM running as two user LaunchAgents: one for the patched mlx_lm.server process on 127.0.0.1:8025, and one for the curl-backed strip-model proxy on 0.0.0.0:8026. The proxy depends on the server, but it is deliberately separate because the proxy is tiny and easy to restart without unloading the model.
# Service names used on our M3 Ultra:
~/Library/LaunchAgents/com.milo.glm52-mlx-server.plist
~/Library/LaunchAgents/com.milo.glm52-strip-proxy.plist
# Inspect state
launchctl print gui/501/com.milo.glm52-mlx-server
launchctl print gui/501/com.milo.glm52-strip-proxy
# Reload after editing a plist
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.milo.glm52-strip-proxy.plist 2>/dev/null || true
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.milo.glm52-strip-proxy.plist
:8012 Qwen service and the :8025 GLM server to coexist comfortably.
6. Smoke test the endpoint
curl -s http://192.168.1.10:8026/v1/models | head
python3 - <<'PY'
import json
payload = {
"model": "glm-5.2",
"messages": [{"role": "user", "content": "Reply exactly: OK"}],
"max_tokens": 20,
"temperature": 0,
}
open('/tmp/glm52_proxy_smoke.json', 'w').write(json.dumps(payload))
PY
/usr/bin/time -p curl -sS --max-time 360 \
-o /tmp/glm52_proxy_smoke.out \
-w 'HTTP:%{http_code}
' \
-X POST http://192.168.1.10:8026/v1/chat/completions \
-H 'Content-Type: application/json' \
-d @/tmp/glm52_proxy_smoke.json
python3 - <<'PY'
import json, pathlib
s = pathlib.Path('/tmp/glm52_proxy_smoke.out').read_text()
d = json.loads(s)
c = d['choices'][0]
print('model', d.get('model'))
print('finish', c.get('finish_reason'))
print('content', repr(c.get('message', {}).get('content')))
print('usage', d.get('usage'))
PY
A healthy warm response for our setup is on the order of seconds, not milliseconds. After the curl-backed proxy swap, a Forge-to-M3U proxy smoke test returned OK in 4.54 seconds with the response model normalized to glm-5.2. A raw localhost server probe immediately after a clean server restart took 43.34 seconds, which is the number to expect for a cold-ish first decode.
7. Speed we measured
These are direct endpoint probes, not Terminal-Bench token accounting. I am intentionally separating them because Terminal-Bench includes Docker setup, agent retries, timeouts, tests, queueing, and tool-loop overhead. Dividing Terminal-Bench token totals by wall time is not model decode speed.
| Probe | Prompt tokens | Output tokens | Wall time | Observed rate |
|---|---|---|---|---|
Tiny warm OK | 11 | 3 | 6.10 s | Overhead dominated |
| 128-integer generation (warm decode, pre step-size fix) | 21 | 220 | 12.04 s | 18.3 output tok/s — measured with different MLX build state; not reproducible at 3-8 tok/s |
| 220-word prose (warm decode, pre step-size fix) | 22 | 249 | 13.66 s | 18.2 output tok/s — measured with different MLX build state; not reproducible at 3-8 tok/s |
| 4.5k-token needle prefill | 4510 | 8 | 39.51 s | ~114 total tok/s prefill-heavy |
| 200k-token prompt extrapolation | 200,000 | 0 | ~1,754 s / ~29.2 min | At 114 tok/s; extrapolated, not separately run |
So the short version is: decode is around 3-8 tokens/second once it is going (the higher 18 tok/s we saw during bring-up was not reproducible with the current server state — decode on a 368 GB model is bandwidth-bound at ~3-8 tok/s depending on compile and GPU state), but long accumulated agent transcripts are the real killer. Terminal-Bench can build prompts into the many-thousands of tokens; a serious 200k-token context at our observed prefill rate would take about 29 minutes just to ingest. In practice, that is far too slow for our normal agent workflow unless the prefix is already cached and reused.
8. Terminal-Bench setup
For Terminal-Bench we used a Python 3.12 venv because newer Python versions caused CLI dependency pain. Docker Desktop also needs to be reachable from non-login SSH shells, which means setting PATH explicitly.
cd ~/.hermes/bench/terminal-bench
python3.12 -m venv .venv312
source .venv312/bin/activate
python -m pip install 'terminal-bench==0.2.18'
export PATH="/usr/local/bin:/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin:$PATH"
export OPENAI_API_KEY=dummy
ulimit -n 8192
The ulimit line matters. A previous full run died after about 24 results with OSError: [Errno 24] Too many open files because the SSH shell inherited ulimit -n = 256. The hard limit on that host was unlimited, so raising the soft limit fixed that harness failure.
terminal-bench runs create \
--dataset terminal-bench-core==0.1.1 \
--agent terminus-2 \
--model openai/glm52 \
--agent-kwarg api_base=http://127.0.0.1:8026/v1 \
--agent-kwarg temperature=0 \
--n-concurrent 1 \
--n-attempts 1 \
--run-id glm52-full-core-1x-c1-threadproxy-$(date +%Y%m%d-%H%M%S) \
--output-path runs \
--no-upload-results \
--log-level info
--global-timeout-multiplier 3; those are useful for capability debugging but are not comparable to stock-timeout results.
Final stock-timeout result
| Run | Dataset / agent | Timeout regime | Resolved | Score | Wall time |
|---|---|---|---|---|---|
glm52-full-core-1x-c1-threadproxy-20260617-042345 |
terminal-bench-core==0.1.1 / terminus-2 |
Stock 1×, n-concurrent=1 |
23 / 80 | 28.75% | ~12h28m |
The result is good enough to prove the model can operate through the harness, but not good enough to crown it as our default local agent route. The score is below the faster local routes in the comparison post, and the failure shape matters: many misses were timeouts, bad-gateway/proxy interactions, or long-context prefill pain rather than clean model refusals.
I also updated the main comparison page with this peer row: Local LLM Testing, June 2026.
9. Hermes autonomy gates
James asked for the gates that matter before trusting this as an autonomous slow orchestrator, not just a text-generation demo. These were run from Forge against the live route http://192.168.1.10:8026/v1 after the proxy was changed to the curl-backed LAN-visible shim.
| Gate | Result | Measured behavior |
|---|---|---|
| Native tool-call smoke | PASS | finish_reason=tool_calls, emitted get_weather with JSON args {"city":"Paris"} in 8.11 s. |
| Full tool-result roundtrip | PASS | Accepted a synthetic tool result (22°C, sunny) and produced a normal final answer in 3.24 s. |
| 2+ concurrent request queue | PASS | Three concurrent short requests all completed: per-request walls 0.98s, 1.55s, 1.55s; total wall 1.55 s. This is a smoke test, not a throughput benchmark. |
| Proxy strip/normalization consistency | PASS | Three repeated completions returned clean assistant messages with keys content, role, no <think>, no reasoning fields, and no leaked tool metadata. Walls: 1.04s, 0.72s, 0.72s. |
| Hermes child-agent path | PASS | hermes chat --provider m3u-glm52-8026 --model glm-5.2 -t safe completed a direct child smoke and then 3/3 small tasks. |
Hermes child-agent task details
| Task | Exit | Wall time | Observed output |
|---|---|---|---|
child_math | 0 | 10.16 s | RESULT=391 |
child_summary | 0 | 11.43 s | Local orchestrator model passed tool-call gates. |
child_json | 0 | 11.49 s | {"status":"ok","agent":"glm52"} |
What worked
- The model loads and serves on an M3 Ultra with 512 GB unified memory.
- Native OpenAI-style chat completions work through the LAN-visible curl-backed proxy.
- Direct decode speed is usable for experiments: roughly 18 output tokens/second in our warm probes.
- GLM-5.2 can solve real Terminal-Bench tasks. In the clean stock 1× run it resolved 23 / 80 tasks, including nontrivial examples such as QEMU startup, bucket creation, Bitcoin node discovery, tmux workflow, pandas version repair, OpenSSL certificate generation, and SWE-style Astropy tasks.
What is not working well
- It needed source patches. If upstream MLX-LM does not include these changes yet, this is a local fork, not a turnkey install.
- The OpenAI compatibility path still needs a shim. LiteLLM, Terminal-Bench, and Hermes send a
modelfield that the MLX server did not tolerate in our setup. The curl-backed strip-model proxy is small and now passes the gates above, but it is still another moving part. - Two proxy failure modes showed up. The first single-threaded proxy blocked behind a stuck client socket. The second threaded
urllibproxy could leave an upstream request stuck after client timeout. The current curl-backed proxy fixed the route for the gates above, but this history is why I still treat the serving path cautiously. - Stock Terminal-Bench timeouts are harsh for this serving path. We saw LiteLLM request timeouts, agent timeouts, and output truncation. Some failures are model behavior; some are client/harness/server interaction. The final score needs to be read with that in mind.
- Long agent transcripts punish prefill badly. Direct decode was around 18 tok/s, but a 4.5k-token prefill probe took 39.5 seconds, about 114 tok/s. A fresh 200k-token context at that rate is roughly 29 minutes of prompt ingestion. That is the practical deal-breaker for our agent workflow.
- File descriptor limits matter. A default non-login macOS SSH shell gave us
ulimit -n = 256, which was too low for a long Docker/tmux/asciinema Terminal-Bench run. - It consumes the M3 Ultra. With this model loaded, the machine is effectively dedicated to GLM. On our run the server process sat around 309 GB RSS and the normal Qwen service had to be stopped.
How I would reproduce it from scratch
- Download the MXFP4 weights with
huggingface-cli. - Create a separate MLX-LM venv; do not patch your production install.
- Apply the
indexer_typesandmake_cache()patches if upstream still needs them. - Validate with
mlx_lm generatebefore starting the server. - Start
mlx_lm.serveron localhost:8025. - Start the curl-backed threaded strip-model proxy on
0.0.0.0:8026if Forge/Hermes needs LAN access, or localhost only if testing entirely on the M3 Ultra. - Run direct chat probes, a long-prefill probe, native tool-call smoke, tool-result roundtrip, and a concurrency smoke before any agent benchmark.
- For Terminal-Bench, use Python 3.12, set Docker's PATH explicitly, and raise
ulimit -n. - Keep timeout regimes separate: stock 1× for peer comparison, extended timeout only as a diagnostic.
Current verdict
GLM-5.2 MXFP4 on a 512 GB M3 Ultra is technically real. It loads, generates, handles native tool calls, survives a tool-result roundtrip, queues a few simultaneous requests, and can run small Hermes child-agent tasks through the non-default m3u-glm52-8026 route. The honest status is now: technically successful, but VERY slow for us; credible only as a lab curiosity / explicitly slow orchestrator, not a default replacement for faster local agent routes.
The work left is straightforward, but the performance verdict is harsher now: upstream the MLX fixes or remove the local patch, make the OpenAI compatibility behavior native instead of proxy-based, and tune client/server timeouts without invalidating benchmark comparisons. Even if those are fixed, the observed prefill rate means this route is not practical for fresh long-context agent sessions unless prompt caching changes the economics. The full stock Terminal-Bench core run is now complete; the next useful experiment would be a clearly labeled extended-timeout diagnostic or a focused prefix-cache test, not silently replacing the 1× result.