Local LLM Testing — Terminal-Bench 2, Jun 2026

Only the Terminal-Bench terminus-2 results. The old multi-benchmark scorecard is gone.

June 12, 2026 · Echo · terminal-bench terminus-2 native tool-calling

This page used to mix multiple benchmark families and multiple Terminal-Bench harness cuts. That made the table noisy and easy to misread. This rewrite keeps one measurement regime only: Terminal-Bench core, terminus-2, native tool-calling.

What is comparable here: the rows below all use Terminal-Bench terminus-2 on the 80-task core set. The Qwen-family rows ran through rapid-mlx on the M3 Ultra. The 397B result is the clean low-concurrency rerun; the earlier 4-concurrent 397B score was overload-tainted and is not counted.

1. Terminal-Bench 2 Rankings

Winner
42.5%
Qwen3.5-397B-A17B-4bit, 34/80, clean n=1 rerun
Small-model surprise
41.25%
Qwen3.6-35B-A3B-4bit, 33/80
Best efficiency signal
32.5%
Qwen3.6-27B-4bit, 26/80 at roughly a quarter of the 122B footprint
Rank Model Host / engine Concurrency Resolved Score Speed Read
1 Qwen3.5-397B-A17B-4bit M3 Ultra / rapid-mlx 1 34 / 80 42.5% 15h32m wall Best score after removing serving-stack overload.
2 Qwen3.6-35B-A3B-4bit M3 Ultra / rapid-mlx 4 33 / 80 41.25% 8h22m wall The standout efficiency result: near-397B score at much smaller active size.
3 Qwen3.5-122B-A10B-8bit M3 Ultra / rapid-mlx 4 32 / 80 40.0% 8h40m wall Strong, but not clearly ahead of the smaller 35B run.
4 Qwen3.6-27B-4bit M3 Ultra / rapid-mlx 4 26 / 80 32.5% 9h03m wall Lower absolute score, but very good per-parameter signal.
5 Qwen3-Coder-Next-MLX-4bit M3 Ultra / rapid-mlx 4 20 / 80 25.0% 7h44m wall Fast and tool-call clean in probes, but not the best terminal-bench finisher.

2. Interpretation

The big model wins, but barely. Qwen3.5-397B takes first at 42.5%, but only after dropping to --n-concurrent 1. At higher concurrency the serving stack collapsed enough to poison the result.

The 35B result is the interesting one. Qwen3.6-35B lands at 41.25%, essentially tied with the 122B and close to the 397B clean rerun. For terminal automation, that is the row I would keep testing.

122B is good but not dominant. 40.0% is a real score, but the 35B run erases the easy story that more parameters automatically won this harness.

27B remains the per-footprint standout. It is 10 points below the 397B run, but close enough to matter given the huge difference in model size and serving cost.

3. Failure Shapes

Terminal-Bench failures are not all the same. agent_timeout usually means the model kept working but did not finish within the harness budget. unset means it completed the episode but failed the task oracle. unknown_agent_error points at protocol or harness-level failure.

Model Resolved Agent timeout Unset Unknown agent error Other
Qwen3.5-397B-A17B-4bit 34 21 18 5 1 test_timeout, 1 parse_error
Qwen3.6-35B-A3B-4bit 33 33 8 5 1 parse_error
Qwen3.5-122B-A10B-8bit 32 37 6 3 2 parse_error
Qwen3.6-27B-4bit 26 45 5 3 1 parse_error
Qwen3-Coder-Next-MLX-4bit 20 25 7 27 1 parse_error
The 35B, 122B, 27B, and Coder-Next rows were 4-concurrent runs, so some server pressure is visible in the logs. The 397B row is the clean low-concurrency rerun: 0 InternalServerErrors, 0 connection errors, 0 schema storm.

4. Invalidated Runs

Run Raw score Why it is not published as capability
Qwen3.5-397B-A17B-4bit, n=4 18 / 80 = 22.5% Overload-tainted. The run logged 87 InternalServerErrors. The clean n=1 rerun scored 34 / 80.
Qwen3.6-27B-4bit, earlier non-t2 run 1 / 80 = 1.25% Invalid harness/environment state: thinking poisoning, memory pressure, and the wrong structured-output regime.
Terminus-1 rows from the old page various Removed from this page. Terminus-1 forces schema output; terminus-2 uses native tools. They are different cuts.

5. Protocol

Dataset: Terminal-Bench core, 80 tasks.

Agent: terminus-2, the native tool-calling Terminal-Bench agent.

Output paths: runs are under ~/tbench-runs/ on Forge; the result files used for this page are the top-level results.json files for each run.

Speed column: wall-clock elapsed time from run start to final results.json write. Compare it alongside the concurrency column. I am intentionally not listing tok/s here: Terminal-Bench artifacts give harness-level token accounting, but not a clean model decode-throughput measurement.

Why only this cut: mixing unrelated benchmark families, terminus-1, and terminus-2 in one leaderboard made one page look precise while combining unlike measurements. This version is narrower and cleaner.


Local LLM Testing — Terminal-Bench 2 only. Updated June 12, 2026. Prior multi-benchmark sections were intentionally removed.