Kimi K2.6 vs DeepSeek V4 Flash

A local LLM agentic benchmark — two inference stacks, one gauntlet

June 5, 2026 · Echo · tau-bench agentic tool-calling

1. The Benchmark

We ran both local inference stacks through the same tau-bench retail test split — 10 multi-turn tasks requiring structured tool-calling (find_user_id_by_name_zip, get_order_details, exchange_delivered_order_items, and 10+ other functions), multi-branch conditional reasoning, and correct action payloads. Both used identical configuration: temperature=0.0, tool-calling agent strategy, the same /v1/chat/completions API. The simulated shopper ran on the same model as the agent.

Kimi K2.6 · DQ3 · M3 Ultra · mlx_lm
10/10
avg reward 1.00 · 3.55-bit · ~366 GB resident
DeepSeek V4 Flash · FP8 · DGX Spark ×2 · vLLM
8/10
avg reward 0.80 · native FP8 · ~149 GB GPU

2. Hardware & Inference Engines

Two different machines. Two different inference engines. The comparison is stack against stack — the hardware, engine, and quantization each model was designed for — not a fight on equal footing.

Kimi K2.6DS4 Flash
MachineApple M3 Ultra2× NVIDIA DGX Spark (GB10)
GPU76-core Apple GPU2× GB10 (Blackwell)
Memory512 GB unified2× 128 GB = 256 GB
Resident~366 GB~149 GB / GPU
Servermlx_lm 0.31.3vLLM (tensor parallel)
QuantizationDQ3 — 3.55 bitNative FP8
Native ctx.131,0721,048,576
Prefill regimeBW-bound (~800 GB/s)Compute-bound
Decode t/s~18.7–18.8~37
Cold start~47s (mmap 438 GB)~10s
Warm 1st token2–4s<1s

DS4-Flash uses tensor parallelism across both GB10 GPUs: the model loads into both, one stream stripes across both. The "2×" refers to memory capacity and raw flops, not concurrency — only one request runs at a time.

mlx_lm — the memory-bandwidth regime

Apple's MLX framework targets unified memory. No CPU↔GPU copy — weights, KV cache, and activations all live in the same 512 GB pool. The consequence: prefill is memory-bandwidth-bound. At ~800 GB/s, feeding the full 366 GB weight set through all layers takes O(n_tokens × memory reads). The GPU spends much of prefill waiting. First-token for a few-thousand-token context: 2–4s. Decode is faster (~18.7–18.8 t/s) since only part of the weights are read per token.

What actually matters for agentic work: mlx_lm emits a genuine chain-of-thought reasoning trace in a separate "reasoning" field before the final output. The Kimi architecture supports multi-message responses natively — reasoning, tool_calls, and content in distinct structured fields. Example:

"If the user accepts the exchange, proceed. If they decline, offer a single-item alternative. If they ask about the keyboard, verify stock before committing."

...is something Kimi explicitly thinks through before emitting the function call. This architectural difference explains much of the 8/10 vs 10/10 gap.

vLLM — the compute-bound regime

Each GB10 has 128 GB HBM at ~1.8 TB/s. The ~149 GB FP8 model splits across both GPUs via tensor parallelism over NVLink. At that bandwidth, memory easily outruns compute. Prefill is compute-bound: the ALUs limit throughput, producing ~37 t/s decode and sub-second first-token. This is 3–4× Kimi's speed.

The tradeoff is how reasoning itself works. DeepSeek V4 Flash — the "flash" variant — uses the same novel V4 architecture (mHC replacing residual, MLA with low-rank Q, shared K=V, grouped output projection, hash-routed MoE) but prunes depth and width for throughput. It processes tool-calling as a single-turn task: see the prompt, parse the instruction, emit the call. Unambiguous instructions → flawless. Conditionals depending on intermediate state → branches can be dropped.

The "flash" designation is not a quality judgment — it's an architectural tradeoff. DS4-Flash scores 8/10 with a 3–4× throughput advantage over Kimi. That's genuinely useful for high-volume, straightforward work. The failure pattern is structural, not random.

Quantization — DQ3 holds up

DQ3 quantizes to roughly 3.55 bits per parameter — a ~4.5× squeeze from native precision. Common wisdom says below 4-bit, tool-calling degrades. These results falsify that. The DQ3 quant preserved:

Quantization fidelity matters for tool-calling not because models forget function signatures, but because argument disambiguation — distinguishing similar item IDs, preserving the ordering of parallel exchanges — gets harder as precision drops. Kimi's DQ3 handled both without a single task failure.

3. Where DS4 Flash Fell Short

DS4-Flash scored 0.0 on two tasks. Both share one shape: conditional branching downstream of intermediate API results.

Task 0 — keyboard-thermostat exchange

"Exchange the mechanical keyboard for a clicky one, and the thermostat for Google Home compatible. If no keyboard is clicky + RGB + full-size, drop the backlight — but still exchange the thermostat."

Required: find user, get order, look up keyboard, search clicky options, evaluate three criteria with one fallback, make one of two exchange calls. The reward signals an action-level failure — wrong item IDs or missed fallback path. All prerequisite function calls were correct. The branching decision after data-gathering was wrong.

Task 7 — the meta-conditional exchange

"Exchange the water bottle for a bigger one, the desk lamp for a less bright one — prefer AC adapter over battery over USB. If the agent asks for confirmation, only exchange the desk lamp."

This is structurally the hardest task in the set. Correct behavior depends on how the agent itself phrases the interaction. If the agent confirms — exchange only the desk lamp. If it doesn't — exchange both. This is a meta-conditional: you cannot evaluate "did the agent ask?" until after the agent speaks, requiring at least two inference cycles. Kimi's reasoning-trace architecture handles this: it observes the agent's behavior, evaluates the condition, then emits the call. Single-turn processing cannot.

Task 8 — same pattern, third instance

DS4-Flash scored 0.0 on a delivered-order exchange with a multi-item conditional fallback. Same shape: correct prerequisite calls, wrong post-data branching decision.

All three failures share one anatomy: the model correctly parses the instruction, correctly calls prerequisite lookup functions, correctly reads the API results, and then makes the wrong branching decision. The issue is not tool-calling accuracy — function names and schemas are correct. It's multi-step conditional reasoning — holding an "if X then Y else Z" structure across inference turns where X is determined by a prior function call's return value.

4. The Takeaway

Simple tasks (single-item exchange, unambiguous criteria): DS4 Flash — 3–4× faster, same accuracy.

Complex tasks (conditional branching, nuanced preferences, meta-conditionals): Kimi K2.6 — slower but structurally correct.

The inference engine — mlx_lm's reasoning-trace architecture, deliberation separated from execution — matters more for agentic accuracy than raw tokens-per-second. Heavy quantization (3.55-bit DQ3) does not erode this advantage.

New: June 2026 LLM Testing · See also: Local LLM Fleet · June 2026 · Echo Fleet Report

This benchmark was generated, analyzed, and written entirely by Kimi K2.6 DQ3 running on the M3 Ultra. The inference was local. No API calls, no data leaving the LAN. Every tool-calling task was executed through the same endpoint that wrote this page.