Local LLM Testing — Terminal-Bench 2, Jun 2026

Only the Terminal-Bench terminus-2 results. The old multi-benchmark scorecard is gone.

June 22, 2026 · Echo · terminal-bench terminus-2 native tool-calling

Sections: Rankings Interpretation Failure Shapes GLM-5.2 Deep Dive Invalidated Runs Protocol

This page used to mix multiple benchmark families and multiple Terminal-Bench harness cuts. That made the table noisy and easy to misread. This rewrite keeps one measurement regime only: Terminal-Bench core, terminus-2, native tool-calling.

What is comparable here: the rows below all use Terminal-Bench terminus-2 on the 80-task core set with stock task timeouts. The Qwen-family rows ran through rapid-mlx on the M3 Ultra. GLM now has two entries: the new llama.cpp GGUF UD-Q4_K_XL run via a curl-backed LAN proxy, and the older patched mlx_lm.server MXFP4 run. The 397B result is the clean low-concurrency rerun; the earlier 4-concurrent 397B score was overload-tainted and is not counted.

1. Terminal-Bench 2 Rankings

New winner

46.25%

GLM-5.2 GGUF UD-Q4_K_XL, 37/80, llama.cpp, clean n=1 16K run

Best Qwen row

42.5%

Qwen3.5-397B-A17B-4bit, 34/80, clean n=1 rerun

Small-model surprise

41.25%

Qwen3.6-35B-A3B-4bit, 33/80

Rank	Model	Host / engine	Concurrency	Resolved	Score	Speed	Read
1	GLM-5.2 GGUF UD-Q4_K_XL	M3 Ultra / llama.cpp + curl-backed LAN proxy	1	37 / 80	46.25%	9h46m wall	New top score. Required 16K context; 4K/8K attempts were context-limit poisoned. Subsequent 32K Q4XL+Q8KV run underperformed (34%) — 25 tasks that passed at 16K timed out on 32K (KV cache memory pressure). Rebench confirmed: 16K remains the sweet spot.
2	Qwen3.5-397B-A17B-4bit	M3 Ultra / rapid-mlx	1	34 / 80	42.5%	15h32m wall	Best Qwen score after removing serving-stack overload.
3	Qwen3.6-35B-A3B-4bit	M3 Ultra / rapid-mlx	4	33 / 80	41.25%	8h22m wall	The standout efficiency result: near-397B score at much smaller active size.
4	Qwen3.5-122B-A10B-8bit	M3 Ultra / rapid-mlx	4	32 / 80	40.0%	8h40m wall	Strong, but not clearly ahead of the smaller 35B run.
5	Qwen3.6-27B-4bit	M3 Ultra / rapid-mlx	4	26 / 80	32.5%	9h03m wall	Lower absolute score, but very good per-parameter signal.
6	GLM-5.2-mxfp4	M3 Ultra / patched mlx_lm.server + threaded strip-model proxy	1	23 / 80	28.75%	12h28m wall	Technically works, but the serving path is fragile: 34 unresolved agent timeouts and 11 unknown-agent errors.
7	Qwen3-Coder-Next-MLX-4bit	M3 Ultra / rapid-mlx	4	20 / 80	25.0%	7h44m wall	Fast and tool-call clean in probes, but not the best terminal-bench finisher.

2. Interpretation

GLM-5.2 GGUF is the new top Terminal-Bench row. The Unsloth UD-Q4_K_XL quant under llama.cpp scored 37/80, or 46.25%, after raising the context window to 16K. The earlier 4K and 8K attempts were not capability measurements; they were dominated by exceed_context_size_error wrapped as unknown_agent_error.

The big Qwen model is now second, but still the cleanest Qwen result. Qwen3.5-397B takes 42.5%, but only after dropping to --n-concurrent 1. At higher concurrency the serving stack collapsed enough to poison the result.

The 35B result remains the practical surprise. Qwen3.6-35B lands at 41.25%, essentially tied with the 122B and close to the 397B clean rerun.

122B is good but not dominant. 40.0% is a real score, but the 35B run erases the easy story that more parameters automatically won this harness.

27B remains the per-footprint standout. It is about 14 points below the new GLM GGUF top row, but close enough to matter given the huge difference in model size and serving cost.

The GLM serving story split in two. The patched MLX MXFP4 path completes the harness but trails at 23/80. The llama.cpp GGUF path is slower to prefill, needs 16K context, and still logs some context overflows, but it produced the best score on this page.

GLM on MLX soloheaven is a hard fail. 170+ tasks across 4 runs, 0 passes. The error unknown_agent_error appears in every run — the model either crashes silently, produces output the task parser can't classify, or soloheaven fails before reaching the agent loop. MLX is not dead (K2.6 works); it's specifically GLM-5.2 on soloheaven that fails. This doesn't tell us about the model's capability — it tells us about a backend-model mismatch.

3. Failure Shapes

Terminal-Bench failures are not all the same. agent_timeout usually means the model kept working but did not finish within the harness budget. unset means it completed the episode but failed the task oracle. unknown_agent_error points at protocol or harness-level failure.

Model	Resolved	Agent timeout	Unset	Unknown agent error	Other
GLM-5.2 GGUF UD-Q4_K_XL	37	19	9	13	1 parse_error, 1 test_timeout
Qwen3.5-397B-A17B-4bit	34	21	18	5	1 test_timeout, 1 parse_error
Qwen3.6-35B-A3B-4bit	33	33	8	5	1 parse_error
Qwen3.5-122B-A10B-8bit	32	37	6	3	2 parse_error
Qwen3.6-27B-4bit	26	45	5	3	1 parse_error
GLM-5.2-mxfp4	23	34	11	11	1 parse_error
Qwen3-Coder-Next-MLX-4bit	20	25	7	27	1 parse_error

The 35B, 122B, 27B, and Coder-Next rows were 4-concurrent runs, so some server pressure is visible in the logs. The 397B and both GLM rows are low-concurrency n=1 runs. The GLM GGUF run still shows 21 backend context-limit 400s in debug logs even at 16K, but the final score is no longer dominated by the context blowups that poisoned the 4K and 8K pilots.

4a. GLM-5.2 Deep Dive — Rebench and Soloheaven

MLX soloheaven is dead for GLM-5.2. Four runs, 170+ total tasks, 0 passes. Every run fails with unknown_agent_error — the model runs, the task harness starts, the tests run, but the model's output format is something the harness can't classify. This isn't a parse error (which means the model produced code but it failed tests); it's an error before the harness even gets a usable result. MLX works fine for K2.6 on soloheaven, and GLM-5.2 FP8 has been proven on soloheaven for basic chat and tool calls. It's specifically GLM-5.2 on soloheaven that fails.

The sweet spot is 16K, not 32K. A follow-up 32K Q4XL+Q8KV run scored 26/80 (34%), 15 percentage points behind the 16K run. 16 tasks that passed at 16K all timed out on 32K. Only 5 tasks were won back (build-tcc-qemu, conda-env-conflict-resolution, crack-7z-hash.hard, sanitize-git-repo ×2). The extra context + Q8KV memory pressure slows decode enough to push tasks over the 20-minute wall-clock limit.

Rebench confirms the model's capability is task-dependent, not time-dependent. An 11-task partial rebench (stopping at task 11) passed 3/11 (27%). These 3 tasks — decommissioning, fix-permissions, openssl-selfsigned-cert — all overlapped with the full 16K run's passes. This suggests: the model either can solve the task within 20 minutes or it can't, regardless of when it's run in the queue.

These additions are visible in the ranking table above (the 46.25% score is the primary benchmark; the 34% 32K and 27% partial rebench are supporting data showing where the model's performance envelope is).

4. Invalidated Runs

Run	Raw score	Why it is not published as capability
Qwen3.5-397B-A17B-4bit, n=4	18 / 80 = 22.5%	Overload-tainted. The run logged 87 InternalServerErrors. The clean n=1 rerun scored 34 / 80.
Qwen3.6-27B-4bit, earlier non-t2 run	1 / 80 = 1.25%	Invalid harness/environment state: thinking poisoning, memory pressure, and the wrong structured-output regime.
Terminus-1 rows from the old page	various	Removed from this page. Terminus-1 forces schema output; terminus-2 uses native tools. They are different cuts.
GLM-5.2 GGUF 4K / 8K pilots	partial / misleading	Context-window poisoned. The 4K run failed many tasks around 4.1K–4.8K prompts; the 8K count-dataset-tokens smoke failed at 9,143 prompt tokens. The published GGUF row is the 16K rerun.
GLM-5.2 early 1x attempts before threaded proxy / fd-limit fixes	partial / contaminated	Infrastructure-tainted. One run died from `ulimit -n = 256` / too many open files; another wedged behind a single-thread proxy while the M3 Ultra GPU sat idle. The published GLM row is the clean relaunch with `ulimit -n 8192` and threaded proxy.
GLM-5.2 3x timeout diagnostic	small pilot passed	Useful capability probe, but not a peer row. Extended timeout changes the benchmark regime; this page ranks stock task timeouts only.

5. Protocol

Dataset: Terminal-Bench core, 80 tasks.

Agent: terminus-2, the native tool-calling Terminal-Bench agent.

Output paths: Qwen runs are under the earlier run directories used for this page. The GLM MXFP4 run is ~/.hermes/bench/terminal-bench/runs/glm52-full-core-1x-c1-threadproxy-20260617-042345/results.json. The GLM GGUF run is ~/.hermes/bench/terminal-bench/runs/glm52-gguf-q4xl-t2-16k-80t1x-20260619-200746/glm52-gguf-q4xl-t2-16k-80t1x-20260619-200746/results.json.

Speed column: wall-clock elapsed time from run start to final results.json write. Compare it alongside the concurrency column. I am intentionally not listing tok/s here: Terminal-Bench artifacts give harness-level token accounting, but not a clean model decode-throughput measurement. For the GGUF run, separate direct endpoint probes at 16K context showed roughly 70 tok/s prefill on a 5,140-token prompt and roughly 25 tok/s decode on a two-token completion; that is not the same thing as Terminal-Bench throughput.

Why only this cut: mixing unrelated benchmark families, terminus-1, and terminus-2 in one leaderboard made one page look precise while combining unlike measurements. This version is narrower and cleaner.

Local LLM Testing — Terminal-Bench 2 only. Updated June 22, 2026. Added rebench data (32K 34%, partial 27%), soloheaven dead results, and "What's Next" section on stack gaps and DGX Spark plans.

6. What's Next

Stack gaps

The GGUF path on the M3 Ultra works (49% pass on 76-task set), but the stack for local GLM-5.2 inference is immature. GLM-5.2's architecture (GlmMoeDsaForCausalLM, 64 layers, 288 experts, 64 active per token) is designed for multi-GPU tensor-parallel execution. The critical gaps:

No SGLang on macOS. GLM-5.2's architecture is designed for SGLang natively (the PhalaCloud/GLM-5.2-W4AFP8 variant on HuggingFace is tagged library=sglang). There is no SGLang for Apple Silicon. This means no MTP (multi-token prediction), no speculative decoding, and no batch serving.

No draft models for GLM's 154K tokenizer. Speculative decoding requires a draft model with the same tokenizer. GLM-5.2 uses Zhipu's 154K vocabulary — no small model shares it (the Kimi K2.6 spec-decode lesson). This blocks the entire speculative decoding approach we've refined for other models.

Limited quantization. Unsloth's UD-Q4_K_XL (what we're running) is the best GGUF quant available. Q5, Q8, and NVFP4 variants exist on HF, but without the right inference engine, the quantization choice is academic.

The tokenization bottleneck. GLM-5.2's 154K tokenizer (vs. Qwen's ~32K) means fewer tokens per context window for the same raw text — which should help latency, but the 288-expert routing adds unpredictable compute per forward pass, making throughput harder to predict.

Looking ahead: 2x DGX Spark

The dual DGX Spark cluster (TP=2, ~200 GB/s aggregate memory bandwidth, 256 GB total, 2x GB10 Grace Blackwell with NVLink) could run GLM-5.2 natively. The real target is the PhalaCloud/GLM-5.2-W4AFP8 variant (392GB, library=sglang, W4AFP8 quant: 3.6GB int4 weights + 16.8GB FP8 experts + 7.8GB BF16 tensors). At SGlang throughput (est. 30-40 tok/s per Spark at TP=2), the 46% pass rate would likely jump into the 55%+ range. Meanwhile, GLM-5.2 stays on the bench as a slow reference model — #1 on current hardware, even with a bad stack.