A local LLM agentic benchmark — two inference stacks, one gauntlet
We ran both local inference stacks through the same tau-bench retail test split — 10 multi-turn tasks requiring structured tool-calling (find_user_id_by_name_zip, get_order_details, exchange_delivered_order_items, and 10+ other functions), multi-branch conditional reasoning, and correct action payloads. Both used identical configuration: temperature=0.0, tool-calling agent strategy, the same /v1/chat/completions API. The simulated shopper ran on the same model as the agent.
Two different machines. Two different inference engines. The comparison is stack against stack — the hardware, engine, and quantization each model was designed for — not a fight on equal footing.
| Kimi K2.6 | DS4 Flash | |
|---|---|---|
| Machine | Apple M3 Ultra | 2× NVIDIA DGX Spark (GB10) |
| GPU | 76-core Apple GPU | 2× GB10 (Blackwell) |
| Memory | 512 GB unified | 2× 128 GB = 256 GB |
| Resident | ~366 GB | ~149 GB / GPU |
| Server | mlx_lm 0.31.3 | vLLM (tensor parallel) |
| Quantization | DQ3 — 3.55 bit | Native FP8 |
| Native ctx. | 131,072 | 1,048,576 |
| Prefill regime | BW-bound (~800 GB/s) | Compute-bound |
| Decode t/s | ~18.7–18.8 | ~37 |
| Cold start | ~47s (mmap 438 GB) | ~10s |
| Warm 1st token | 2–4s | <1s |
DS4-Flash uses tensor parallelism across both GB10 GPUs: the model loads into both, one stream stripes across both. The "2×" refers to memory capacity and raw flops, not concurrency — only one request runs at a time.
Apple's MLX framework targets unified memory. No CPU↔GPU copy — weights, KV cache, and activations all live in the same 512 GB pool. The consequence: prefill is memory-bandwidth-bound. At ~800 GB/s, feeding the full 366 GB weight set through all layers takes O(n_tokens × memory reads). The GPU spends much of prefill waiting. First-token for a few-thousand-token context: 2–4s. Decode is faster (~18.7–18.8 t/s) since only part of the weights are read per token.
What actually matters for agentic work: mlx_lm emits a genuine chain-of-thought reasoning trace in a separate "reasoning" field before the final output. The Kimi architecture supports multi-message responses natively — reasoning, tool_calls, and content in distinct structured fields. Example:
...is something Kimi explicitly thinks through before emitting the function call. This architectural difference explains much of the 8/10 vs 10/10 gap.
Each GB10 has 128 GB HBM at ~1.8 TB/s. The ~149 GB FP8 model splits across both GPUs via tensor parallelism over NVLink. At that bandwidth, memory easily outruns compute. Prefill is compute-bound: the ALUs limit throughput, producing ~37 t/s decode and sub-second first-token. This is 3–4× Kimi's speed.
The tradeoff is how reasoning itself works. DeepSeek V4 Flash — the "flash" variant — uses the same novel V4 architecture (mHC replacing residual, MLA with low-rank Q, shared K=V, grouped output projection, hash-routed MoE) but prunes depth and width for throughput. It processes tool-calling as a single-turn task: see the prompt, parse the instruction, emit the call. Unambiguous instructions → flawless. Conditionals depending on intermediate state → branches can be dropped.
The "flash" designation is not a quality judgment — it's an architectural tradeoff. DS4-Flash scores 8/10 with a 3–4× throughput advantage over Kimi. That's genuinely useful for high-volume, straightforward work. The failure pattern is structural, not random.
DQ3 quantizes to roughly 3.55 bits per parameter — a ~4.5× squeeze from native precision. Common wisdom says below 4-bit, tool-calling degrades. These results falsify that. The DQ3 quant preserved:
Quantization fidelity matters for tool-calling not because models forget function signatures, but because argument disambiguation — distinguishing similar item IDs, preserving the ordering of parallel exchanges — gets harder as precision drops. Kimi's DQ3 handled both without a single task failure.
DS4-Flash scored 0.0 on two tasks. Both share one shape: conditional branching downstream of intermediate API results.
"Exchange the mechanical keyboard for a clicky one, and the thermostat for Google Home compatible. If no keyboard is clicky + RGB + full-size, drop the backlight — but still exchange the thermostat."
Required: find user, get order, look up keyboard, search clicky options, evaluate three criteria with one fallback, make one of two exchange calls. The reward signals an action-level failure — wrong item IDs or missed fallback path. All prerequisite function calls were correct. The branching decision after data-gathering was wrong.
"Exchange the water bottle for a bigger one, the desk lamp for a less bright one — prefer AC adapter over battery over USB. If the agent asks for confirmation, only exchange the desk lamp."
This is structurally the hardest task in the set. Correct behavior depends on how the agent itself phrases the interaction. If the agent confirms — exchange only the desk lamp. If it doesn't — exchange both. This is a meta-conditional: you cannot evaluate "did the agent ask?" until after the agent speaks, requiring at least two inference cycles. Kimi's reasoning-trace architecture handles this: it observes the agent's behavior, evaluates the condition, then emits the call. Single-turn processing cannot.
DS4-Flash scored 0.0 on a delivered-order exchange with a multi-item conditional fallback. Same shape: correct prerequisite calls, wrong post-data branching decision.
All three failures share one anatomy: the model correctly parses the instruction, correctly calls prerequisite lookup functions, correctly reads the API results, and then makes the wrong branching decision. The issue is not tool-calling accuracy — function names and schemas are correct. It's multi-step conditional reasoning — holding an "if X then Y else Z" structure across inference turns where X is determined by a prior function call's return value.
New: June 2026 LLM Testing ·
See also: Local LLM Fleet · June 2026 ·
Echo Fleet Report
This benchmark was generated, analyzed, and written entirely by Kimi K2.6 DQ3 running on the M3 Ultra. The inference was local. No API calls, no data leaving the LAN. Every tool-calling task was executed through the same endpoint that wrote this page.