June 2026 LLM Testing

State of the local fleet — what we've run, what we like, what's next

June 5, 2026 · James · local-llm fleet agentic

The goal hasn't changed: run autonomous agentic loops on hardware I own, on my LAN, without burning Anthropic credits and without shipping a byte off-box. June was about pressure-testing the two stacks that earned a place on the fleet, and lining up the next candidate. Echo runs the bench; I make the calls. Here's where things stand.

Both of the models below were put through the same gauntlet — the tau-bench retail agentic faceoff: ten multi-turn tasks, structured tool-calling, conditional branching on intermediate API results, temperature=0.0, identical agent strategy. If you want the full failure-mode autopsy, read that post. This is the verdict layer on top.

The Scorecard

Kimi K2.6 · DQ3 · M3 Ultra
10/10
I really like it — but the lossy quant and the speed bother me.
DeepSeek V4 Flash · FP8 · DGX Spark ×2
8/10
I like it. Functions pretty well, and it's fast.

Kimi K2.6 — the one I want to love

K2.6 swept the tau-bench retail split 10/10, avg reward 1.00. Not a single task failure across multi-step reasoning chains, conditional branching on intermediate results, and the nasty meta-conditional tasks where correct behavior depends on how the agent itself phrases the interaction. That's the best agentic result anything local has posted on this bench. The reason is architectural: on mlx_lm this build emits a genuine chain-of-thought reasoning trace in a separate reasoning field before it commits to a tool call. It deliberates, then acts. That separation is what carries the hard conditionals.

So why the hesitation? Two things, and they're both real:

1. Lossy quant. What's running is DQ3 — roughly 3.55 bits per parameter, a ~4.5× squeeze, engineered to fit ~366 GB resident into the M3 Ultra's 512 GB. It held up shockingly well on tool-calling (the sub-4-bit-kills-tool-use folklore is wrong here). But I'm always aware I'm running a compressed model, not the real thing, and I don't love making fleet decisions on a quant I can't fully trust at the margins.

2. Speed. ~18.7–18.8 t/s decode, 2–4s to first token on a few-thousand-token context. Prefill is memory-bandwidth-bound on unified memory at ~800 GB/s, and you feel every bit of it. For accuracy-critical work it's worth the wait. For anything high-volume, it's a tax.

Verdict: keep it for the hard, conditional, accuracy-first jobs. It's the smartest local agent I have. I just wish it were lighter and faster.

DeepSeek V4 Flash — the workhorse

DS4-Flash on the dual-Spark vLLM cluster scored 8/10, avg reward 0.80 — at roughly 3–4× Kimi's throughput (~37 t/s decode, sub-second first token, native FP8, ~149 GB per GPU). I like this one. It functions well, it's quick, and the two tasks it dropped weren't random — they were all the same shape: conditional branching downstream of intermediate API results. The model parses the instruction correctly, makes the prerequisite calls correctly, reads the results correctly, then makes the wrong branch. That's the "flash" tradeoff — depth and width pruned for speed, single-turn tool-call processing instead of deliberate-then-act.

For the bulk of straightforward agentic work — unambiguous instructions, single-item actions — DS4-Flash is the right default. Same accuracy as Kimi on those, a fraction of the latency. It earns its slot.

Up Next: StepFun (8-bit), then Qwen3 97B

First on the bench is StepFun at 8-bit. The appeal is direct: 8-bit is a far less lossy quant than Kimi's 3.55-bit DQ3, which goes straight at my biggest hesitation — making fleet decisions on a heavily compressed model. If an 8-bit model can hold the reasoning-trace accuracy on the hard conditional tasks, that's a quant I trust without the asterisk.

Then Qwen3 97B on the M3 Ultra. The open question there is whether it can land in the gap I keep staring at: Kimi's reasoning-trace accuracy without Kimi's weight and latency.

No numbers yet for either — neither has been run. When Echo puts them through the same tau-bench gauntlet, the results go here.

Testing protocol stays fixed so the comparison holds: same tau-bench retail split, temperature=0.0, same agent strategy, same endpoint shape. Every quantitative claim above traces to a real run — the Kimi/DS4 figures are from the published tau-bench faceoff. Qwen3 97B gets no published number until it's actually measured.

See also: Kimi K2.6 vs DeepSeek V4 Flash — the tau-bench faceoff · Local LLM Fleet · June 2026 · Echo Fleet Report