Only the Terminal-Bench terminus-2 results. The old multi-benchmark scorecard is gone.
This page used to mix multiple benchmark families and multiple Terminal-Bench harness cuts. That made the table noisy and easy to misread. This rewrite keeps one measurement regime only: Terminal-Bench core, terminus-2, native tool-calling.
| Rank | Model | Host / engine | Concurrency | Resolved | Score | Speed | Read |
|---|---|---|---|---|---|---|---|
| 1 | Qwen3.5-397B-A17B-4bit | M3 Ultra / rapid-mlx | 1 | 34 / 80 | 42.5% | 15h32m wall | Best score after removing serving-stack overload. |
| 2 | Qwen3.6-35B-A3B-4bit | M3 Ultra / rapid-mlx | 4 | 33 / 80 | 41.25% | 8h22m wall | The standout efficiency result: near-397B score at much smaller active size. |
| 3 | Qwen3.5-122B-A10B-8bit | M3 Ultra / rapid-mlx | 4 | 32 / 80 | 40.0% | 8h40m wall | Strong, but not clearly ahead of the smaller 35B run. |
| 4 | Qwen3.6-27B-4bit | M3 Ultra / rapid-mlx | 4 | 26 / 80 | 32.5% | 9h03m wall | Lower absolute score, but very good per-parameter signal. |
| 5 | Qwen3-Coder-Next-MLX-4bit | M3 Ultra / rapid-mlx | 4 | 20 / 80 | 25.0% | 7h44m wall | Fast and tool-call clean in probes, but not the best terminal-bench finisher. |
The big model wins, but barely. Qwen3.5-397B takes first at 42.5%, but only after dropping to --n-concurrent 1. At higher concurrency the serving stack collapsed enough to poison the result.
The 35B result is the interesting one. Qwen3.6-35B lands at 41.25%, essentially tied with the 122B and close to the 397B clean rerun. For terminal automation, that is the row I would keep testing.
122B is good but not dominant. 40.0% is a real score, but the 35B run erases the easy story that more parameters automatically won this harness.
27B remains the per-footprint standout. It is 10 points below the 397B run, but close enough to matter given the huge difference in model size and serving cost.
Terminal-Bench failures are not all the same. agent_timeout usually means the model kept working but did not finish within the harness budget. unset means it completed the episode but failed the task oracle. unknown_agent_error points at protocol or harness-level failure.
| Model | Resolved | Agent timeout | Unset | Unknown agent error | Other |
|---|---|---|---|---|---|
| Qwen3.5-397B-A17B-4bit | 34 | 21 | 18 | 5 | 1 test_timeout, 1 parse_error |
| Qwen3.6-35B-A3B-4bit | 33 | 33 | 8 | 5 | 1 parse_error |
| Qwen3.5-122B-A10B-8bit | 32 | 37 | 6 | 3 | 2 parse_error |
| Qwen3.6-27B-4bit | 26 | 45 | 5 | 3 | 1 parse_error |
| Qwen3-Coder-Next-MLX-4bit | 20 | 25 | 7 | 27 | 1 parse_error |
| Run | Raw score | Why it is not published as capability |
|---|---|---|
| Qwen3.5-397B-A17B-4bit, n=4 | 18 / 80 = 22.5% | Overload-tainted. The run logged 87 InternalServerErrors. The clean n=1 rerun scored 34 / 80. |
| Qwen3.6-27B-4bit, earlier non-t2 run | 1 / 80 = 1.25% | Invalid harness/environment state: thinking poisoning, memory pressure, and the wrong structured-output regime. |
| Terminus-1 rows from the old page | various | Removed from this page. Terminus-1 forces schema output; terminus-2 uses native tools. They are different cuts. |
Dataset: Terminal-Bench core, 80 tasks.
Agent: terminus-2, the native tool-calling Terminal-Bench agent.
Output paths: runs are under ~/tbench-runs/ on Forge; the result files used for this page are the top-level results.json files for each run.
Speed column: wall-clock elapsed time from run start to final results.json write. Compare it alongside the concurrency column. I am intentionally not listing tok/s here: Terminal-Bench artifacts give harness-level token accounting, but not a clean model decode-throughput measurement.
Why only this cut: mixing unrelated benchmark families, terminus-1, and terminus-2 in one leaderboard made one page look precise while combining unlike measurements. This version is narrower and cleaner.