Six boxes on the LAN. Four do inference. Two run agents. One cloud escape hatch for anything that needs frontier capability without fighting quant artifacts. This is the topology as of June 4, 2026 — M3 Ultra is dedicated to Kimi K2.6, and M5 Max handles the embedding + reranker stack.
The Two Inference Engines
Fleet inference now runs on two engine classes (Apple Silicon MLX + CUDA vLLM cluster), each chosen for what it does best:
mlx_lm (Apple MLX) — stable, proven, one-model-per-port. Runs Kimi K2.6 on the M3 Ultra and the embedding/reranker stack on M5 Max. Reliable tool-call handling and genuine reasoning traces. ~19 t/s for K2.6.
vLLM (CUDA) — the Spark cluster. Two DGX GB10s bonded over QSFP56 200G in a TP=2 configuration, serving DeepSeek V4 Flash (official FP8, ~149 GB). ~37 t/s end-to-end, ~44 t/s pure decode. This is our only CUDA inference path and handles the largest open-weight frontier model we host.
Machine by Machine
vLLM
DGX Spark 1 • .11
GB10, 128 GB unified. vLLM head for DS4-Flash TP=2 cluster. Rank 0. Serves API on :8000. QSFP56 connected to Spark 2.
vLLM+Comfy
DGX Spark 2 • .12
GB10, 128 GB unified. Headless vLLM worker (rank 1). Also hosts ComfyUI (image gen) and Chatterbox TTS. All three share the same box without conflict.
mlx_lm
M3 Ultra • .10
512 GB, 800 GB/s. Dedicated to a single frontier-class model: Kimi K2.6 DQ3 on mlx_lm (:8013). All other services migrated to the M5 Max.
mlx_lm
M5 Max • .18
128 GB, 400 GB/s. Dedicated to the semantic retrieval stack (Qwen3-Embedding-8B :8002, Qwen3-Reranker-4B :8003). Powers semantic skill search — the Qwen3-Coder-Next was migrated off to free the memory budget.
OpenClaw
Mac Studio M4 Max • .5
Milo's machine. No local LLM serving — purely an agent host. Routes to all fleet endpoints via OpenClaw.
Hermes
Forge • .19
Linux lab node. Cohosts Bandit (OpenClaw) and Echo (Hermes Agent). LAN hub, Docker host, skill search index. Automates the fleet — config management, model swap scripts, blog publishing.
What Changed
Compared to the June 3 topology, two significant shifts:
M3 Ultra consolidated to Kimi-only. GLM-5.1, MiniMax M2.7, and the embedding/reranker stack all migrated off the M3 Ultra. The 512 GB is now dedicated to a single frontier model — Kimi K2.6 DQ3 on :8013 — giving it the full memory budget without competing services.
M5 Max dedicated to retrieval. The Qwen3-Coder-Next (80B 4-bit) was migrated off M5. The semantic stack (embed :8002 / reranker :8003) remains. The coder's agentic coding duties moved to the DS4-Flash cluster.
Model Summary
Model
Host
Engine
Throughput
Role
DeepSeek V4 Flash
Spark 1 (.11:8000)
vLLM TP=2
~37 t/s
Frontier open-weight, CUDA cluster
Kimi K2.6 DQ3
M3 Ultra (.10:8013)
mlx_lm
~19 t/s
Default reflex, only local LLM on M3 Ultra
Qwen3-Embedding-8B
M5 Max (.18:8002)
mlx_lm (embed)
—
Semantic embeddings, skill search
Qwen3-Reranker-4B
M5 Max (.18:8003)
mlx_lm (embed)
—
Relevance reranking, skill search
Note on cloud fallbacks. For anything that needs true frontier capability without fighting quantization artifacts or context limits — DeepSeek V4 Pro on Fireworks, Claude Opus 4.7 on Anthropic. The local fleet handles 90% of daily agentic work. The escape hatch is always there.
What's Next
The fleet is in a good state — each box has a clear role, and the cluster topology between the Sparks actually means we serve a model we couldn't on a single GB10. A few things percolating:
NVE (Never-Visited-Empty) KV cache for the DS4-Flash cluster. The MTP spec decode gave decent gains (1.7x). NVE would reach further into the attention mechanics.
Embedding + reranker monitoring on M5 Max. The M5 Max now runs two critical services — embed and reranker. A simple uptime + latency scrape per port would catch drift before it breaks tooling.
A proper fleet monitoring dashboard. Right now it's manual curl loops and SSH. One Grafana panel per inference endpoint, maybe.
This is a living document. As the fleet evolves, this post gets updated.