Local LLM Stack — Current Architecture and Benchmarks

Originally June 2026 • rewritten June 16, 2026 • updated June 18, 2026

This page used to describe an earlier M3-centric rebuild. That is no longer the interesting truth. The stack has settled into a small fleet with a fast local default, a giant-model test bench, and a few narrow always-on support models.

Current snapshot: Hermes routes ordinary local work to DS4-Flash on the dual-Spark vLLM cluster. The M3 Ultra is the large-model bench, not the primary route: Qwen-class models remain the useful big-model probes, while GLM-5.2 MXFP4 now runs through the MLX path as a non-default slow-orchestrator candidate. GLM-5.2's prefill is glacial for us — about 114 tok/s in a prefill-heavy probe, which extrapolates to roughly 29 minutes for a fresh 200k-token prompt. The M5 Max carries the small resident workers: embeddings, reranking, title/selection, vision, and a fast Qwen3.6 35B tool-capable endpoint.

June 18 GLM-5.2 note: GLM-5.2 is technically running and passes useful tool/autonomy gates, but it is too slow to be the primary local model. It is acceptable for non-time-sensitive lab work, background analysis, or deliberate slow-orchestrator experiments where waiting is fine; it is the wrong default for interactive agent loops with fresh long context.

Diagram callout: the M3 Ultra box is now explicitly GLM-5.2 on M3 Ultra. That path is the slow lane: ~114 prefill tok/s, ~29 minutes for a fresh 200k-token prompt, and non-urgent work only.

What Is Actually Running

Default local agent pathDeepSeek V4 Flash on the Spark pair. This is the stable local baseline: vLLM, tensor-parallel across both GB10 boxes, DeepSeek V4 tool parser enabled, and MTP configured.

Large-model test benchM3 Ultra is not the default route in this snapshot. It is where we test big local reasoning/tool behavior without destabilizing the default path. GLM-5.2 MXFP4 belongs here as a slow-lane model: useful when time does not matter, too glacial on prefill to be the primary interactive route.

Resident support layerM5 Max runs the cheap-but-important pieces: Qwen3.6 35B for fast tool-capable local calls, Llama 3B for tiny selection/title work, Qwen3 embeddings, Qwen3 reranker, and Qwen3-VL for image work.

Cloud escalationClaude Max still exists as a manual escalation path for high-stakes judgment, public actions, or tasks where local models are the wrong tool. It is not the default local stack.

Basic Endpoint Benchmarks

These are deliberately small operational probes, not a leaderboard. Most chat rows are the median of two direct OpenAI-compatible chat completions run on June 16, 2026. The GLM-5.2 row uses later direct probes from the MLX route and separates decode from prefill because prefill is the operational problem. The inference-engine column matters because vLLM, Rapid-MLX/vllm_mlx, plain MLX, patched mlx_lm.server, and mlx_vlm are not interchangeable performance regimes. Throughput is usage.completion_tokens / wall-clock seconds unless labeled as prefill-heavy. Tool smoke is a single OpenAI tools request asking for get_weather(city="Paris").

Model / role	Host	Inference engine	Served model / quant	Completion throughput	Median wall	Tool smoke
DeepSeek V4 Flash / DS4-Flash default local agent baseline	Spark1+Spark2 GB10 TP=2	vLLM TP=2 across dual GB10	deepseek-v4-flash	30.7 29.7 t/s, 31.7 t/s	5.3s	PASS — structured OpenAI tool_calls[]
GLM-5.2 MXFP4 M3 slow-lane / non-time-sensitive work	M3 Ultra	patched `mlx_lm.server` + strip-model proxy	glm-5.2 via :8026 strip-model proxy → :8025 mlx_lm.server	18.2 output tok/s prefill-heavy probe: ~114 tok/s; 200k prompt ≈ 29.2 min	13.66s decode probe 39.51s for 4.5k-token prefill probe	PASS — native tool call and tool-result roundtrip; too slow for primary route
Qwen3.5-397B-A17B-4bit large MoE test bench	M3 Ultra	Rapid-MLX / vllm_mlx bench route	/Users/jamesmeadlock/models/Qwen3.5-397B-A17B-4bit	26.4 17.6 t/s, 35.2 t/s	4.8s	PASS — structured OpenAI tool_calls[]
Qwen3.6-35B-A3B-4bit M5 local tool worker	M5 Max	MLX chat server	qwen3.6-35b-stock4	102.9 91.1 t/s, 114.6 t/s	1.5s	PASS — structured OpenAI tool_calls[]
Llama-3.2-3B-Instruct-8bit cheap selection/title-class tasks	M5 Max	MLX chat server	mlx-community/Llama-3.2-3B-Instruct-8bit	88.6 73.2 t/s, 104.1 t/s	1.4s	FAIL — understands request, but emits raw text instead of tool_calls[]
Qwen3-VL-30B-A3B-MLX-4bit vision / image-text endpoint	M5 Max	`mlx_vlm` server	/Users/jamesmeadlock/models/Qwen3-VL-30B-A3B-MLX-4bit	47.5 33.1 t/s, 61.9 t/s	2.8s	PASS — structured OpenAI tool_calls[]

Model / role

Host

Inference engine

Served model / quant

Completion throughput

Median wall

Tool smoke

DeepSeek V4 Flash / DS4-Flash
default local agent baseline

Spark1+Spark2 GB10 TP=2

vLLM TP=2 across dual GB10

deepseek-v4-flash

30.7
29.7 t/s, 31.7 t/s

5.3s

PASS — structured OpenAI tool_calls[]

GLM-5.2 MXFP4
M3 slow-lane / non-time-sensitive work

M3 Ultra

patched mlx_lm.server + strip-model proxy

glm-5.2 via :8026 strip-model proxy → :8025 mlx_lm.server

18.2 output tok/s
prefill-heavy probe: ~114 tok/s; 200k prompt ≈ 29.2 min

13.66s decode probe
39.51s for 4.5k-token prefill probe

PASS — native tool call and tool-result roundtrip; too slow for primary route

Qwen3.5-397B-A17B-4bit
large MoE test bench

M3 Ultra

Rapid-MLX / vllm_mlx bench route

/Users/jamesmeadlock/models/Qwen3.5-397B-A17B-4bit

26.4
17.6 t/s, 35.2 t/s

4.8s

PASS — structured OpenAI tool_calls[]

Qwen3.6-35B-A3B-4bit
M5 local tool worker

M5 Max

MLX chat server

qwen3.6-35b-stock4

102.9
91.1 t/s, 114.6 t/s

1.5s

PASS — structured OpenAI tool_calls[]

Llama-3.2-3B-Instruct-8bit
cheap selection/title-class tasks

M5 Max

MLX chat server

mlx-community/Llama-3.2-3B-Instruct-8bit

88.6
73.2 t/s, 104.1 t/s

1.4s

FAIL — understands request, but emits raw text instead of tool_calls[]

Qwen3-VL-30B-A3B-MLX-4bit
vision / image-text endpoint

M5 Max

mlx_vlm server

/Users/jamesmeadlock/models/Qwen3-VL-30B-A3B-MLX-4bit

47.5
33.1 t/s, 61.9 t/s

2.8s

PASS — structured OpenAI tool_calls[]

Read the numbers as ops signals. DS4-Flash is the default because it combines acceptable speed with the right serving stack and tool parser. Qwen3.6 35B is the speed surprise on M5. The 397B bench is useful because it is big and tool-capable, not because this short probe makes it look uniformly faster than DS4; its two samples varied from 17.6 to 35.2 t/s. GLM-5.2 is a different lesson: it can run and pass gates, but its prefill is glacial enough that it belongs only in non-time-sensitive work.

Embedding and Reranking Benchmarks

The aux models do not speak chat. They were measured with their native endpoints using two tiny smoke inputs.

Model / role	Host	Latency	Smoke result
Qwen3-Embedding-8B-mxfp8 embedding	M5 Max	0.118s median 0.180s, 0.057s	4096 dims
Qwen3-Reranker-4B-mxfp8 rerank	M5 Max	0.498s median 0.852s, 0.143s	top picks correct in both smoke prompts

Model / role

Host

Latency

Smoke result

Qwen3-Embedding-8B-mxfp8
embedding

M5 Max

0.118s median
0.180s, 0.057s

4096 dims

Qwen3-Reranker-4B-mxfp8
rerank

M5 Max

0.498s median
0.852s, 0.143s

top picks correct in both smoke prompts

Routing Decisions

What Changed From the Old Version

Bottom Line

The stack is no longer “which single Mac should host the local brain?” It is a routed fleet. DS4-Flash handles the default local agent path, M3 Ultra is the big-model proving ground and slow-lane host for models like GLM-5.2, and M5 Max handles the resident support services that make the agent feel fast.