Running GLM-5.2 MXFP4 on an M3 Ultra with MLX

June 17, 2026 · M3 Ultra / Apple Silicon / MLX / Terminal-Bench notes

Short version: GLM-5.2 MXFP4 can run on a 512 GB M3 Ultra under MLX, but it was not plug-and-play. We needed a patched MLX environment, a local OpenAI-compatible proxy, a higher file-descriptor limit for Terminal-Bench, and a threaded proxy to avoid deadlocks. The model is capable, but the current serving path is very slow for us: direct decode is only around 3-8 output tok/s, and the prefill-heavy path measured about 114 tok/s — later improved to 179 tok/s by raising --prefill-step-size (see the optimization post). At that rate, a 200k-token agent transcript would spend roughly 29 minutes just in prompt processing before useful generation. That makes it a lab curiosity / slow-orchestrator candidate, not a practical default route.
June 17 update: the clean stock-timeout Terminal-Bench run finished. GLM-5.2 MXFP4 resolved 23 / 80 tasks on terminal-bench-core==0.1.1 with terminus-2, n-concurrent=1, and the default 1× task timeouts. That is 28.75%, with a wall-clock run time of about 12h28m. This is now a measured result, not a bring-up estimate.
June 17 later update: after wiring the route into Hermes as m3u-glm52-8026 (non-default), the trust gates passed: native tool call, tool-result roundtrip, 3-way queue smoke, proxy normalization, and three Hermes child-agent tasks. The proxy also changed from a Python urllib forwarder to a curl-backed LAN-visible shim because the urllib path could leave mlx_lm.server stuck after client timeouts.
June 18 practical-speed update: the long-context prefill number is the headline now. Our prefill-heavy probe came out to about 114 tok/s. Extrapolated to a 200k-token prompt, that is ~1,754 seconds, or ~29.2 minutes, before the model has even generated the answer. This is why our conclusion changed from “slow but maybe useful” to: GLM-5.2 MXFP4 is a VERY slow model for our local agent workflow. Prefix caching can help repeated prompts, but fresh long agent transcripts are punishing.
What this post covers

Hardware and starting point

The target machine is a Mac Studio M3 Ultra with 512 GB unified memory. The model is GLM-5.2-mxfp4, stored locally under ~/models/GLM-5.2-mxfp4. On this machine the directory is about 368 GB on disk with 76 safetensor shards. During our first successful load/generation smoke test, MLX reported a peak around 395 GB. During the later benchmark run, the Python server process showed about 309 GB RSS in ps. Those are different measurements, but both matter: disk footprint, resident process size, and MLX peak memory are not the same thing.

ComponentValue we used
HostM3 Ultra Mac Studio, 512 GB unified memory
Model path~/models/GLM-5.2-mxfp4
Model size on disk~368 GB
Shard count76 .safetensors files
Python runtime for patched serverHomebrew Python 3.14 inside ~/venvs/glm52-mlx-patch
Public-ish API route used by toolsLiteLLM / OpenAI client → :8026 threaded strip-model proxy → :8025 mlx_lm.server

1. Download the model

Use the Hugging Face CLI. Do not write a custom Python downloader for this; the CLI already handles auth, retries, cache layout, and resumability better than a one-off script.

# Example shape — adjust the repo id if the published path changes.
huggingface-cli download MODEL_OR_ORG/GLM-5.2-mxfp4 \
  --local-dir ~/models/GLM-5.2-mxfp4 \
  --local-dir-use-symlinks False

# Sanity check
find ~/models/GLM-5.2-mxfp4 -name '*.safetensors' | wc -l
du -sh ~/models/GLM-5.2-mxfp4

Our local copy had 76 safetensor shards. If your count is lower or you still have .incomplete files, stop and finish the download before debugging MLX.

2. Create a patched MLX environment

We used a separate virtual environment so the patches did not contaminate the production MLX install. On our system, Python is externally-managed, so use a venv instead of installing into the global interpreter.

python3 -m venv ~/venvs/glm52-mlx-patch
source ~/venvs/glm52-mlx-patch/bin/activate
python -m pip install -U pip
python -m pip install -U mlx mlx-lm transformers huggingface_hub
Important: the exact upstream MLX/MLX-LM state changes quickly. The patches below document what was required for our run. Check upstream first; if these fixes have landed, prefer the released package over carrying a local patch forever.

3. Patch GLM-5.2 support in MLX-LM

The model did not load cleanly for us with the unmodified code. The failing path was the DeepSeek V3.2 / GLM MoE DSA implementation expecting indexer behavior that was not present for every GLM layer. The fix was to preserve indexer_types from the config and make the per-layer indexer optional. Then make_cache() must allocate one KV cache for layers without an indexer and two caches for layers with an indexer.

Patch A: preserve indexer_types in glm_moe_dsa.py

# File inside the venv:
# ~/venvs/glm52-mlx-patch/lib/python3.14/site-packages/mlx_lm/models/glm_moe_dsa.py

class ModelArgs(BaseModelArgs):
    ...
    attention_bias: bool
    indexer_types: Optional[list] = None
    rope_scaling: Dict = None
    rope_theta: Optional[float] = None

Patch B: make the DeepSeek/GLM indexer optional per layer

# File inside the venv:
# ~/venvs/glm52-mlx-patch/lib/python3.14/site-packages/mlx_lm/models/deepseek_v32.py

# In ModelArgs:
indexer_types: Optional[list] = None

# In the attention layer __init__, after rope scaling setup:
indexer_types = getattr(config, "indexer_types", None)
use_indexer = (
    indexer_types is None
    or layer_idx >= len(indexer_types)
    or indexer_types[layer_idx] == "full"
)
self.indexer = Indexer(config) if use_indexer else None

Patch C: allocate cache shape based on whether a layer has an indexer

# In deepseek_v32.py, Model.make_cache:
def make_cache(self):
    caches = []
    for layer in self.layers:
        if getattr(layer.self_attn, "indexer", None) is None:
            caches.append(CacheList(KVCache()))
        else:
            caches.append(CacheList(KVCache(), KVCache()))
    return caches

That was enough to make a local mlx_lm generate smoke test work for us. If your first run fails before serving, validate this path directly before adding API servers, proxies, or benchmarks.

source ~/venvs/glm52-mlx-patch/bin/activate
python -m mlx_lm generate \
  --model ~/models/GLM-5.2-mxfp4 \
  --trust-remote-code \
  --prompt "Reply with exactly OK." \
  --max-tokens 16

4. Start the MLX server

This is the server command we used. It binds only to localhost. The chat template arguments disable thinking mode because we were testing agent/tool-loop behavior and wanted direct answers.

source ~/venvs/glm52-mlx-patch/bin/activate
mkdir -p ~/models/_logs

/usr/bin/time -l python -m mlx_lm.server \
  --model ~/models/GLM-5.2-mxfp4 \
  --trust-remote-code \
  --host 127.0.0.1 \
  --port 8025 \
  --max-tokens 2048 \
  --prompt-cache-size 2 \
  --prompt-cache-bytes 2147483648 \
  --prefill-step-size 2048 \
  --chat-template-args '{"enable_thinking":false,"reasoning_effort":null}' \
  --temp 0.0 \
  --log-level INFO \
  > ~/models/_logs/glm52_tb2_server8025.log 2>&1

For the benchmark run we started it manually so failures were obvious. After the run, we kept GLM up and converted the server into a LaunchAgent so it can start at login. That means this recipe now describes the live route we used, not just a temporary test process.

5. Add a threaded strip-model proxy

This is the stupid but necessary adapter. LiteLLM and Terminal-Bench send a model field like openai/glm52. The MLX server path we used tried to interpret that model field and could 404 instead of treating the loaded model as fixed. The proxy removes only the model key before forwarding to the real server.

Our first proxy was single-threaded. That was a mistake. During a long Terminal-Bench run, one stuck client socket blocked the whole proxy and left the M3 Ultra GPU idle. The second proxy was threaded, but still used Python urllib to call mlx_lm.server; after a client timeout, that path could wedge a generation for minutes. The current live proxy binds on 0.0.0.0:8026, uses curl subprocesses for upstream calls, strips the incoming model field, and normalizes the returned model name back to glm-5.2.

cat > /tmp/glm52_strip_model_proxy_threaded.py <<'PY'
#!/usr/bin/env python3
"""Threaded OpenAI-compatible proxy for GLM-5.2 on mlx_lm.server.

- Binds LAN-visible on :8026.
- Strips the client-supplied `model` field before forwarding to mlx_lm.server.
- Uses curl subprocesses rather than urllib for POSTs; mlx_lm.server has shown
  hangs with Python urllib clients while curl is reliable.
- Normalizes response `model` back to the requested model or glm-5.2.
"""
from __future__ import annotations

import json
import subprocess
import tempfile
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from pathlib import Path

UPSTREAM = "http://127.0.0.1:8025"
HOST = "0.0.0.0"
PORT = 8026
DEFAULT_MODEL = "glm-5.2"

class Handler(BaseHTTPRequestHandler):
    protocol_version = "HTTP/1.1"

    def _send(self, status: int, body: bytes, content_type: str = "application/json"):
        self.send_response(status)
        self.send_header("Content-Type", content_type)
        self.send_header("Content-Length", str(len(body)))
        self.send_header("Connection", "close")
        self.end_headers()
        self.wfile.write(body)
        self.close_connection = True

    def _curl(self, method: str, path: str, body: bytes | None = None, timeout: int = 900):
        url = UPSTREAM + path
        cmd = ["curl", "-sS", "--max-time", str(timeout), "-w", "\n__HTTP_STATUS__:%{http_code}", "-X", method]
        tmp = None
        try:
            if body is not None:
                tmp = tempfile.NamedTemporaryFile(delete=False)
                tmp.write(body)
                tmp.close()
                cmd += ["-H", "Content-Type: application/json", "-d", "@" + tmp.name]
            cmd.append(url)
            p = subprocess.run(cmd, text=False, capture_output=True, timeout=timeout + 10)
            out = p.stdout or b""
            marker = b"\n__HTTP_STATUS__:"
            if marker in out:
                payload, status_b = out.rsplit(marker, 1)
                try:
                    status = int(status_b.strip() or b"502")
                except Exception:
                    status = 502
            else:
                payload, status = out, 502
            if p.returncode != 0 and not payload:
                payload = json.dumps({"error": (p.stderr or b"").decode("utf-8", "replace")}).encode()
                status = 502
            return status, payload
        finally:
            if tmp is not None:
                Path(tmp.name).unlink(missing_ok=True)

    def do_GET(self):
        try:
            status, payload = self._curl("GET", self.path, timeout=60)
            self._send(status, payload)
        except BrokenPipeError:
            pass
        except Exception as e:
            self._send(502, json.dumps({"error": repr(e)}).encode())

    def do_POST(self):
        requested_model = DEFAULT_MODEL
        try:
            length = int(self.headers.get("Content-Length", "0"))
            raw = self.rfile.read(length)
            body = json.loads(raw.decode("utf-8")) if raw else {}
            if self.path.rstrip("/") == "/v1/chat/completions" and isinstance(body, dict):
                requested_model = body.pop("model", None) or DEFAULT_MODEL
            data = json.dumps(body).encode("utf-8")
            status, payload = self._curl("POST", self.path, data, timeout=900)
            # Normalize model field in successful JSON responses.
            if status == 200:
                try:
                    obj = json.loads(payload.decode("utf-8"))
                    if isinstance(obj, dict) and "model" in obj:
                        obj["model"] = requested_model
                        payload = json.dumps(obj).encode("utf-8")
                except Exception:
                    pass
            self._send(status, payload)
        except BrokenPipeError:
            pass
        except Exception as e:
            self._send(502, json.dumps({"error": repr(e)}).encode())

    def log_message(self, fmt, *args):
        print(f"{self.address_string()} - - [{self.log_date_time_string()}] " + fmt % args, flush=True)

if __name__ == "__main__":
    srv = ThreadingHTTPServer((HOST, PORT), Handler)
    srv.daemon_threads = True
    print(f"curl-backed strip-model proxy listening on http://{HOST}:{PORT} -> {UPSTREAM}", flush=True)
    srv.serve_forever()
PY

nohup python3 /tmp/glm52_strip_model_proxy_threaded.py \
  > ~/models/_logs/glm52_strip_proxy8026.log 2>&1 &

Keep it up at login with launchd

Our current M3 Ultra state keeps GLM running as two user LaunchAgents: one for the patched mlx_lm.server process on 127.0.0.1:8025, and one for the curl-backed strip-model proxy on 0.0.0.0:8026. The proxy depends on the server, but it is deliberately separate because the proxy is tiny and easy to restart without unloading the model.

# Service names used on our M3 Ultra:
~/Library/LaunchAgents/com.milo.glm52-mlx-server.plist
~/Library/LaunchAgents/com.milo.glm52-strip-proxy.plist

# Inspect state
launchctl print gui/501/com.milo.glm52-mlx-server
launchctl print gui/501/com.milo.glm52-strip-proxy

# Reload after editing a plist
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.milo.glm52-strip-proxy.plist 2>/dev/null || true
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.milo.glm52-strip-proxy.plist
The normal Qwen service on the M3 Ultra is disabled while this GLM route is resident. GLM fits, but it effectively consumes the box; do not expect the old :8012 Qwen service and the :8025 GLM server to coexist comfortably.

6. Smoke test the endpoint

curl -s http://192.168.1.10:8026/v1/models | head

python3 - <<'PY'
import json
payload = {
    "model": "glm-5.2",
    "messages": [{"role": "user", "content": "Reply exactly: OK"}],
    "max_tokens": 20,
    "temperature": 0,
}
open('/tmp/glm52_proxy_smoke.json', 'w').write(json.dumps(payload))
PY

/usr/bin/time -p curl -sS --max-time 360 \
  -o /tmp/glm52_proxy_smoke.out \
  -w 'HTTP:%{http_code}
' \
  -X POST http://192.168.1.10:8026/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d @/tmp/glm52_proxy_smoke.json

python3 - <<'PY'
import json, pathlib
s = pathlib.Path('/tmp/glm52_proxy_smoke.out').read_text()
d = json.loads(s)
c = d['choices'][0]
print('model', d.get('model'))
print('finish', c.get('finish_reason'))
print('content', repr(c.get('message', {}).get('content')))
print('usage', d.get('usage'))
PY

A healthy warm response for our setup is on the order of seconds, not milliseconds. After the curl-backed proxy swap, a Forge-to-M3U proxy smoke test returned OK in 4.54 seconds with the response model normalized to glm-5.2. A raw localhost server probe immediately after a clean server restart took 43.34 seconds, which is the number to expect for a cold-ish first decode.

7. Speed we measured

These are direct endpoint probes, not Terminal-Bench token accounting. I am intentionally separating them because Terminal-Bench includes Docker setup, agent retries, timeouts, tests, queueing, and tool-loop overhead. Dividing Terminal-Bench token totals by wall time is not model decode speed.

ProbePrompt tokensOutput tokensWall timeObserved rate
Tiny warm OK1136.10 sOverhead dominated
128-integer generation (warm decode, pre step-size fix)2122012.04 s18.3 output tok/s
— measured with different MLX build state; not reproducible at 3-8 tok/s
220-word prose (warm decode, pre step-size fix)2224913.66 s18.2 output tok/s
— measured with different MLX build state; not reproducible at 3-8 tok/s
4.5k-token needle prefill4510839.51 s~114 total tok/s prefill-heavy
200k-token prompt extrapolation200,0000~1,754 s / ~29.2 minAt 114 tok/s; extrapolated, not separately run

So the short version is: decode is around 3-8 tokens/second once it is going (the higher 18 tok/s we saw during bring-up was not reproducible with the current server state — decode on a 368 GB model is bandwidth-bound at ~3-8 tok/s depending on compile and GPU state), but long accumulated agent transcripts are the real killer. Terminal-Bench can build prompts into the many-thousands of tokens; a serious 200k-token context at our observed prefill rate would take about 29 minutes just to ingest. In practice, that is far too slow for our normal agent workflow unless the prefix is already cached and reused.

Operational conclusion: for us, this is a very slow model. It is interesting because it runs and passes real tool/autonomy gates, but the prefill behavior makes it a poor fit for fresh long-context agent runs on our M3 Ultra MLX path.

8. Terminal-Bench setup

For Terminal-Bench we used a Python 3.12 venv because newer Python versions caused CLI dependency pain. Docker Desktop also needs to be reachable from non-login SSH shells, which means setting PATH explicitly.

cd ~/.hermes/bench/terminal-bench
python3.12 -m venv .venv312
source .venv312/bin/activate
python -m pip install 'terminal-bench==0.2.18'

export PATH="/usr/local/bin:/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin:$PATH"
export OPENAI_API_KEY=dummy
ulimit -n 8192

The ulimit line matters. A previous full run died after about 24 results with OSError: [Errno 24] Too many open files because the SSH shell inherited ulimit -n = 256. The hard limit on that host was unlimited, so raising the soft limit fixed that harness failure.

terminal-bench runs create \
  --dataset terminal-bench-core==0.1.1 \
  --agent terminus-2 \
  --model openai/glm52 \
  --agent-kwarg api_base=http://127.0.0.1:8026/v1 \
  --agent-kwarg temperature=0 \
  --n-concurrent 1 \
  --n-attempts 1 \
  --run-id glm52-full-core-1x-c1-threadproxy-$(date +%Y%m%d-%H%M%S) \
  --output-path runs \
  --no-upload-results \
  --log-level info
Benchmark comparability note: the command above is the stock timeout / one-concurrent-trial run we wanted for the peer table. Earlier diagnostic runs used --global-timeout-multiplier 3; those are useful for capability debugging but are not comparable to stock-timeout results.

Final stock-timeout result

RunDataset / agentTimeout regimeResolvedScoreWall time
glm52-full-core-1x-c1-threadproxy-20260617-042345 terminal-bench-core==0.1.1 / terminus-2 Stock 1×, n-concurrent=1 23 / 80 28.75% ~12h28m

The result is good enough to prove the model can operate through the harness, but not good enough to crown it as our default local agent route. The score is below the faster local routes in the comparison post, and the failure shape matters: many misses were timeouts, bad-gateway/proxy interactions, or long-context prefill pain rather than clean model refusals.

I also updated the main comparison page with this peer row: Local LLM Testing, June 2026.

9. Hermes autonomy gates

James asked for the gates that matter before trusting this as an autonomous slow orchestrator, not just a text-generation demo. These were run from Forge against the live route http://192.168.1.10:8026/v1 after the proxy was changed to the curl-backed LAN-visible shim.

GateResultMeasured behavior
Native tool-call smokePASSfinish_reason=tool_calls, emitted get_weather with JSON args {"city":"Paris"} in 8.11 s.
Full tool-result roundtripPASSAccepted a synthetic tool result (22°C, sunny) and produced a normal final answer in 3.24 s.
2+ concurrent request queuePASSThree concurrent short requests all completed: per-request walls 0.98s, 1.55s, 1.55s; total wall 1.55 s. This is a smoke test, not a throughput benchmark.
Proxy strip/normalization consistencyPASSThree repeated completions returned clean assistant messages with keys content, role, no <think>, no reasoning fields, and no leaked tool metadata. Walls: 1.04s, 0.72s, 0.72s.
Hermes child-agent pathPASShermes chat --provider m3u-glm52-8026 --model glm-5.2 -t safe completed a direct child smoke and then 3/3 small tasks.

Hermes child-agent task details

TaskExitWall timeObserved output
child_math010.16 sRESULT=391
child_summary011.43 sLocal orchestrator model passed tool-call gates.
child_json011.49 s{"status":"ok","agent":"glm52"}
Interpretation: this is enough to call GLM-5.2 a credible slow orchestrator candidate for Echo-style lab work. It is not a default-route recommendation. The safe-toolset Hermes smoke succeeded, but the full default Hermes prompt previously timed out at 240 seconds before producing a response, so production use needs explicit timeout budgeting and likely a trimmed tool surface.

What worked

What is not working well

This is not yet a clean local agent route. It is a successful bring-up, not a production recommendation.

How I would reproduce it from scratch

  1. Download the MXFP4 weights with huggingface-cli.
  2. Create a separate MLX-LM venv; do not patch your production install.
  3. Apply the indexer_types and make_cache() patches if upstream still needs them.
  4. Validate with mlx_lm generate before starting the server.
  5. Start mlx_lm.server on localhost :8025.
  6. Start the curl-backed threaded strip-model proxy on 0.0.0.0:8026 if Forge/Hermes needs LAN access, or localhost only if testing entirely on the M3 Ultra.
  7. Run direct chat probes, a long-prefill probe, native tool-call smoke, tool-result roundtrip, and a concurrency smoke before any agent benchmark.
  8. For Terminal-Bench, use Python 3.12, set Docker's PATH explicitly, and raise ulimit -n.
  9. Keep timeout regimes separate: stock 1× for peer comparison, extended timeout only as a diagnostic.

Current verdict

GLM-5.2 MXFP4 on a 512 GB M3 Ultra is technically real. It loads, generates, handles native tool calls, survives a tool-result roundtrip, queues a few simultaneous requests, and can run small Hermes child-agent tasks through the non-default m3u-glm52-8026 route. The honest status is now: technically successful, but VERY slow for us; credible only as a lab curiosity / explicitly slow orchestrator, not a default replacement for faster local agent routes.

The work left is straightforward, but the performance verdict is harsher now: upstream the MLX fixes or remove the local patch, make the OpenAI compatibility behavior native instead of proxy-based, and tune client/server timeouts without invalidating benchmark comparisons. Even if those are fixed, the observed prefill rate means this route is not practical for fresh long-context agent sessions unless prompt caching changes the economics. The full stock Terminal-Bench core run is now complete; the next useful experiment would be a clearly labeled extended-timeout diagnostic or a focused prefix-cache test, not silently replacing the 1× result.