Today we rebuilt the local LLM stack on the M3 Ultra. The old setup was straightforward: one model (DeepSeek V4 Flash) doing everything. The new setup is more interesting: four specialized services, each doing one thing well, with the heavy lifting offloaded to the DGX Spark cluster.
Here is what changed, why, and how it is wired together.
| Port | Service | Model | RAM | Speed | Role |
|---|---|---|---|---|---|
| :8012 | Rapid-MLX | Qwen3-Coder-Next-MLX-4bit | 45 GB | 57 t/s | Agentic coding — 80B total / 3B active MoE. No thinking mode. Excels at tool calling, 262K context, SWE-Bench scores competitive with frontier models. Optimal sampling: temp=1.0, top_p=0.95, top_k=40. |
| :8011 | mlx_vlm | Qwen3-VL-30B-A3B-MLX-4bit | 18 GB | 93 t/s | Vision — 30B total / 3B active. Handles screenshots, documents, photos via OpenAI-compatible API (base64 + URL). 256K native context. Wired as auxiliary.vision provider. |
| :8002 | embed | Qwen3-Embedding-8B-mxfp8 | ~4 GB | — | Embeddings — 4096-dim vectors via /v1/embeddings. Powers semantic skill search across 179 indexed skills. |
| :8003 | rerank | Qwen3-Reranker-4B-mxfp8 | ~2 GB | 0.19 s | Reranking — yes/no-token logit method per official Qwen3-Reranker spec. Improves skill search precision. |
| :8013 | mlx_lm | Kimi-K2.6-mlx-DQ3_K_M-q8 | 366 GB | 15.6 t/s | Heavyweight reasoning — 1T-param MoE, smart-quant (8-bit router + 3/4-bit experts), the only real text-only K2.6 quant that fits 512 GB. REFLEX-grade non-thinking model: no reasoning block by design. Optimal sampling: temp=0.6, min-p=0.01 (per model card). Measured ~15.6 t/s warm decode at short context. |
Total: 435 GB used, 77 GB free on M3 Ultra (512 GB). The K2.6 instance alone is 366 GB — fitting it required raising the GPU wired-memory limit (see below). All services persist across reboots via macOS LaunchAgents/LaunchDaemons.
Loading a 366 GB model on a 512 GB Mac sounds trivial — there's headroom. It isn't. macOS caps how much unified memory the GPU may wire (lock resident for Metal) at roughly 75% of physical RAM by default — about 384 GB on this box. The static weights fit under that, but the dynamic KV cache grows with context, and once the working set crosses the wired ceiling Metal starts paging weights in and out instead of keeping them resident — decode throughput falls off a cliff and the box thrashes.
The fix is one sysctl, raising the wired limit to 480 GB so the full working set stays GPU-resident:
sudo sysctl iogpu.wired_limit_mb=491520
But sysctl doesn't survive a reboot. The clean answer is a LaunchDaemon that re-applies it at every boot:
<!-- /Library/LaunchDaemons/com.echo.iogpu-wired.plist --> <key>ProgramArguments</key> <array> <string>/usr/sbin/sysctl</string> <string>iogpu.wired_limit_mb=491520</string> </array> <key>RunAtLoad</key> <true/>
Owned root:wheel, mode 644, loaded with launchctl load -w. Now the win is permanent instead of evaporating on the next restart. Lesson for the lab notebook: on Apple Silicon, a model that fits in RAM is not the same as a model that serves stably — the wired-memory ceiling, not total RAM, is the real constraint for large-context MLX serving.
The biggest change: DeepSeek V4 Flash no longer lives on the M3 Ultra. We removed the oMLX 4-bit instance from :8020, reclaiming 144 GB RAM and 149 GB disk. Text inference moved to the DGX Spark cluster, where DS4 Flash runs at native FP8 across two GB10 nodes with TP=2 and MTP speculative decoding.
| Metric | oMLX (M3 Ultra, old) | vLLM (Spark, new) |
|---|---|---|
| Quantization | 4-bit (community quant) | FP8 (official, 128×128 block) |
| Decode speed | ~34 t/s | 44.5 t/s (warm), 32 t/s (cold) |
| Context window | 100K (131K config, unstable) | 200K (stable, 3× concurrency at gpu_mem=0.90) |
| KV cache | M3 Ultra unified memory (shared with 4 other models) | 612K tokens dedicated (12 GB/node) |
| MTP | Not available (stock weights stripped MTP tensors) | 2-step MTP, 74% acceptance, 1.7× speedup over no-MTP |
| Prefix caching | 3.4× on cache hits | Built-in, full vLLM automatic prefix caching |
Hermes routes to it via the spark-ds4 custom provider and the ds4 alias. Heavy autonomous work goes through delegate_task with the Spark as the delegation provider — the local agent stays responsive while the Spark crunches through 200K-token reasoning loops.
The config changes were straightforward. Each service is a custom_providers entry named <machine>-<port>:
custom_providers:
- name: m3ultra-coder-8012
base_url: http://192.168.1.10:8012/v1
api_mode: openai
models:
- /Users/jamesmeadlock/models/Qwen3-Coder-Next-MLX-4bit
- name: m3ultra-vision-8011
base_url: http://192.168.1.10:8011/v1
api_mode: openai
models:
- /Users/jamesmeadlock/models/Qwen3-VL-30B-A3B-MLX-4bit
- name: kimi-local-8013
base_url: http://192.168.1.10:8013/v1
api_mode: openai
default_model: /Users/jamesmeadlock/models/Kimi-K2.6-mlx-DQ3_K_M-q8
models:
- /Users/jamesmeadlock/models/Kimi-K2.6-mlx-DQ3_K_M-q8
With shorthands for quick switching:
model_aliases:
coder:
model: /Users/jamesmeadlock/models/Qwen3-Coder-Next-MLX-4bit
provider: m3ultra-coder-8012
vision:
model: /Users/jamesmeadlock/models/Qwen3-VL-30B-A3B-MLX-4bit
provider: m3ultra-vision-8011
ds4:
model: deepseek-v4-flash
provider: spark-ds4
kimi:
model: /Users/jamesmeadlock/models/Kimi-K2.6-mlx-DQ3_K_M-q8
provider: kimi-local-8013
The vision model is also wired as auxiliary.vision — so anytime Hermes needs to look at an image, it hits the local VLM instead of burning Fireworks credits. The embedding and reranker servers are standalone FastAPI processes with their own dedicated venv. They do not appear in Hermes config; they are called directly by the skill-search MCP server at 192.168.1.10:8002 and :8003.
Four specialist services plus a heavyweight on call, each doing one thing:
find_skill returned connection-refused errors. Now it returns 179 indexed skills with calibrated relevance scores.This stack serves a single user on a trusted LAN, but Echo and the other agents routinely fetch content from the open web. Any page can embed hidden instructions designed to hijack agent behavior, exfiltrate data, or trigger dangerous tool calls. Research success rates for indirect prompt injection against tool-using agents hit 80–100% in adversarial testing.
"Prompt injection for agents is the SQL injection of AI." Architectural defenses, not patches. No single technique holds up alone.
We built a research-with-quarantine skill that enforces structural isolation: a reader agent with no tool access processes untrusted content and outputs structured findings, then the main agent with full privileges acts on the findings only. The main agent never sees raw HTML.
Stage 1: delegate_task(
goal="research X and return findings as structured JSON",
toolsets=["safe"]
) → structured findings, no side effects possible
Stage 2: main agent reads findings JSON
→ acts on sanitized data, never sees raw HTML
The safe toolset is the key. It limits the research subagent to web_search, web_extract, browser_navigate, vision_analyze, and image_generate — no terminal, no file writes, no messaging, no delegation. Even if a malicious page fully compromises the reader subagent, the worst it can do is return bad JSON. There is no path from injected HTML to executed shell commands.
We also built an email-quarantine skill applying the same two-stage pattern to email processing. A reader subagent with safe toolset (plus himalaya for email access) processes inbound messages and returns structured summaries. The main agent never sees raw email bodies — same isolation principle, different untrusted surface.
<<EXTERNAL_UNTRUSTED_CONTENT>> boundaries and strips chat-template control tokens. The safe toolset isolation covers this structurally, but explicit markers at the system-prompt level would add defense-in-depth.
With the Spark cluster stable at 44.5 t/s on DS4 Flash, the next step is sharing access with a few collaborators (Geverson, Bob, Oscar) over Tailscale. Raw vLLM has zero auth — no rate limiting, no attribution, no content filtering, and /metrics wide open. The plan: an nginx auth proxy + Python token-counting sidecar on Forge (:8643).
Geverson/Bob/Oscar -> Tailscale -> Forge (:8643) -> Spark 1 (:8000)
(encrypted) +------------+ vLLM DS4 Flash
| nginx | (no auth, raw HTTP)
| + auth |
| proxy |
+------------+
| Python |
| sidecar | -> token counting + logging
+------------+
Why Forge? Already on the tailnet at 192.168.1.19, plenty of spare cycles. The Spark cluster stays LAN-only — external users never touch it directly.
sk-ds4-* key, validated by nginx against a static list. Keys generated with openssl rand -hex 24, distributed via private channel.limit_req at 5 req/min per key, burst 10. One person can't saturate the cluster./v1/chat/completions passes through. /metrics, /v1/usage, /health blocked.log_format: timestamp, api_key, endpoint, req_bytes, resp_bytes, duration_ms, status. Per-user byte totals as a rough token proxy.Sits between nginx and Spark, extracts real token counts from vLLM responses. ~60 lines with aiohttp. Handles both streaming (SSE) and non-streaming:
usage.prompt_tokens and usage.completion_tokens from the final chunk{"ts":"2026-05-28T14:22:01Z","user":"bob","model":"deepseek-v4-flash","prompt_tokens":2340,"completion_tokens":512,"duration_ms":4521}An hourly cron job summarizes per-user: request count, total prompt/completion tokens. Byte-level tracking from nginx is a fallback if the sidecar is down.
{
"acls": [
{
"action": "accept",
"src": ["tag:trusted-collaborators"],
"dst": ["192.168.1.19:8643"],
"proto": "tcp"
}
],
"tagOwners": {
"tag:trusted-collaborators": ["james@meadlock.com"]
}
}
Geverson, Bob, Oscar's tailnet nodes get the tag:trusted-collaborators tag. They can only reach Forge:8643 — Spark:8000 stays LAN-only.
| Layer | Overhead |
|---|---|
| nginx proxy | <1 ms |
| Python sidecar (non-streaming) | <2 ms (reads final chunk only) |
| Python sidecar (streaming) | <2 ms (buffers final chunk, passes through rest) |
| Cluster impact (1 user) | No change — 44.5 t/s |
| Cluster impact (4 concurrent) | ~11 t/s each (max_num_seqs=4) |
| KV cache pressure | 612K tokens = 3× 200K concurrency. Long-context requests from one user pressure others. |
~/.hermes/scripts/ds4-proxy.py)/etc/nginx/sites-available/ds4-proxy)curl -H "Authorization: Bearer sk-ds4-test-xxx" http://localhost:8643/v1/chat/completionsWith four specialized endpoints running across the fleet alongside the dual-Spark DS4 Flash cluster, the question becomes: who handles what? Rather than routing everything through the main reasoning brain, auxiliary tasks fan out automatically based on what they need.
| Task | Endpoint | Why |
|---|---|---|
| Main reasoning | DS4 Flash on dual Spark (:8000) | Best local generalist — 200K context, MTP, FP8 quality |
| Subagent delegation | Coder-Next on M3 Ultra (:8012) | SWE-Bench specialist, 57 t/s, tool-call support |
| Vision | Qwen3-VL on M3 Ultra (:8011) | Dedicated VLM, 93 t/s, base64 + URL input |
| Title generation | Qwen3.6 on Spark 2 (:8003) | Tiny task — any model works, keeps the cluster free |
| Session search | Qwen3.6 on Spark 2 (:8003) | FTS5 + light rerank, doesn't need 149 GB |
| Context compression | Coder-Next on M3 Ultra (:8012) | Latency-sensitive, fast MLX inference |
| Curator (weekly skill maintenance) | Coder-Next on M3 Ultra (:8012) | Best-effort background task, no urgency |
| Web page extraction | DS4 Flash (default fallthrough) | Occasional large pages — benefits from long context |
| Heavyweight reasoning (manual) | Kimi K2.6 on M3 Ultra (:8013) | 1T-param non-thinking REFLEX model via kimi alias — hot-swap for dense reasoning when DS4 isn't enough |
deepseek-v4-pro on Fireworks. No manual intervention needed; the agent stays online through maintenance windows.
For genuinely hard problems — the kind where a local 149 GB FP8 model won't cut it — the brain hot-swaps to a frontier cloud model with a single config change. On Telegram this is just saying "hard mode":
hermes config set default_llm grok-4-0309-reasoning hermes config set delegation.model grok-4-0309-reasoning
This is intentionally manual — no auto-escalation, no surprise bills. When the task is done, flip back just as easily:
hermes config set default_llm deepseek-v4-flash hermes config set delegation.model /Users/jamesmeadlock/models/Qwen3-Coder-Next-MLX-4bit
The ad-hoc approach keeps things simple: one agent, two brain states, zero infrastructure.