The previous post ended on a satisfying note: both DGX Sparks running, Qwen3-235B distributed across them via Ray, models registered in OpenClaw. Six hours of setup hell, but a clean end state. That was a few days ago. This is what happened when we tried to actually use that infrastructure in earnest.

Short version: we tried to run too much at the same time, things broke badly, and we spent a non-trivial chunk of time debugging instead of building. This is the honest account of that, and the configuration rethink that came out of it.

The Plan That Seemed Reasonable

After setup, the Sparks were running a lot. The intended steady state looked like this:

On paper, this felt achievable. Each Spark has 128GB of unified memory and a GB10 GPU. A 235B MoE model, a couple of mid-size models, and some audio inference pipelines — how bad could the resource pressure actually be?

The answer: bad enough to need physical reboots.

What Actually Happened

The first sign of trouble was throughput collapse. Qwen3-235B, which had been serving at ~15 tok/s cleanly during initial testing, dropped to 4-6 tok/s under concurrent load. At first we assumed it was normal — the model is big, inference competes with memory bandwidth. But the degradation was non-linear. Adding the Ollama workload on Spark 1 shouldn't have affected vLLM's allocated VRAM at all.

The issue is the Grace Blackwell unified memory architecture. "Unified memory" means the GPU and system RAM share the same physical pool. It also means every process on the machine — including Ollama's KV cache, the TTS model's attention layers, Parakeet's CTC buffers — is competing with vLLM for the same physical address space. There is no hard partition. The operating system and CUDA runtime are allocating from the same pool, and under pressure, they start evicting each other.

vLLM was the most sensitive to this. It pre-allocates its KV cache at startup based on available memory. When Ollama's Nemotron instance expanded its context during a long generation, it was reclaiming pages that vLLM had nominally reserved. The result was silent VRAM pressure that showed up as latency spikes rather than clean out-of-memory errors.

That would have been manageable. The worse problems came from the voice pipeline.

The Zombie CUDA Processes

Parakeet STT and Qwen3-TTS are both CUDA-dependent. Parakeet is a CTC-based ASR model; Qwen3-TTS is an autoregressive synthesis model that's heavier than it looks. Running both on Spark 2 — which was also the Ray worker node for the distributed vLLM job — created a situation where multiple CUDA contexts were competing for the same device.

Under normal conditions, CUDA multi-process service (MPS) handles this gracefully. The problem is that vLLM's distributed backend, when coordinating across Ray workers, does not play well with MPS. The Ray worker on Spark 2 holds a persistent CUDA context for its tensor parallel shard. Parakeet and TTS were opening additional contexts against the same device. When any one of them crashed — and they did crash — CUDA didn't cleanly release the other contexts.

The result was zombie CUDA processes: entries visible in nvidia-smi with non-zero VRAM allocations and no corresponding /proc entry you could kill. The VRAM was gone. You couldn't reclaim it without a reboot. Not a process restart, not nvidia-smi --gpu-reset (which is disabled on DGX Spark). A full power cycle.

We did this three times before we stopped trying to fight it.

The vLLM Exit Code 7 Problem

The other recurring failure mode was vLLM itself crashing with exit code 7 during engine initialization. This one is more obscure. Exit code 7 from a vLLM process means the engine worker subprocess exited abnormally before it could report a proper error — it's a catch-all for initialization failures that happen before the Python exception handler is set up.

The cause in our case was quantization backend conflicts. The Marlin backend environment variables we'd set to make NVFP4 work on GB10 (VLLM_NVFP4_GEMM_BACKEND=marlin, etc.) were being applied globally via /etc/environment. When vLLM restarted cold, if the CUDA context from a prior crashed run hadn't been fully cleaned up, the Marlin backend would fail to initialize its GEMM kernels against the stale device state. Exit code 7. No useful stack trace. Just gone.

The fix was to source those environment variables explicitly in the launch script rather than setting them globally, and to add a hard check for zombie CUDA processes before any vLLM launch:

#!/bin/bash
# Check for zombie CUDA processes before launch
ZOMBIE=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader 2>/dev/null | \
  while read pid; do
    [ -d "/proc/$pid" ] || echo "$pid"
  done)

if [ -n "$ZOMBIE" ]; then
  echo "Zombie CUDA processes detected: $ZOMBIE"
  echo "Reboot required before vLLM can start cleanly."
  exit 1
fi

# Now launch vLLM with explicit env
VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_USE_FLASHINFER_MOE_FP4=0 \
vllm serve Qwen/Qwen3-235B-A22B-NVFP4 \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  ...

This helped with the cold-start failures. It didn't solve the underlying problem of VRAM contention from running too many things simultaneously.

The Ray Cluster Instability

The distributed Ray cluster — our backbone for splitting the 235B model across both Sparks — also became less stable as we added workloads. Ray's head node on Spark 1 is sensitive to system memory pressure. When Ollama's Nemotron instance was running a long context generation, it was consuming enough unified memory that the Ray head node's health check latency spiked, triggering worker disconnections on Spark 2.

A disconnected Ray worker with a live vLLM tensor parallel shard is not a graceful failure. vLLM doesn't checkpoint. The entire distributed inference job dies. You restart everything. The startup sequence for a 27-shard 235B model takes approximately four minutes. Those four minutes happened more times than we'd like to admit.

There are mitigations — Ray's --object-store-memory limits, vLLM's --max-model-len to constrain KV cache allocation, Ollama's OLLAMA_MAX_LOADED_MODELS=1. We applied all of them. They reduced the failure rate. They did not eliminate it. When the voice pipeline was also running, we were back to periodic crashes regardless.

The Honest Assessment

We spent the better part of two days in this loop: bring everything up, watch throughput degrade, hit a crash, reboot, try again with slightly different configuration. The Sparks are powerful machines. They are not infinitely powerful. The Grace Blackwell architecture's unified memory is genuinely impressive — having 128GB of high-bandwidth memory accessible to both GPU and CPU computation is the right call for most workloads. But it means resource contention is immediate and global in a way that traditional discrete GPU architectures aren't. On a conventional setup, the GPU OOMs and the CPU keeps running. On the DGX Spark, they're the same pool, and they'll fight each other until something gives.

We were also burning real time and real API tokens on this infrastructure thrash rather than on the actual work we were trying to do. That's worth saying plainly. Every debugging session on Ray disconnections is a session not spent on what we're here to build.

The Decision: Shelve the Voice Pipeline, Focus on What Matters

We've decided to shelve the local voice pipeline for now. Not cancel — shelve. Parakeet STT and Qwen3-TTS are still in the plan for the longer term. But they're not what Phase 4 of the actual project needs, and trying to run them simultaneously with the inference workload was generating more friction than value.

What Phase 4 needs is clean training data. Specifically: we have 7,792 real agent conversation turns from Milo's operational history, and we need to score every one of them for quality before they go anywhere near fine-tuning. The scoring methodology is an ensemble approach — dual judges, qwen32 and qwen235, each independently rating every turn on multiple dimensions, with disagreements surfaced for human review.

7,792 turns × 2 judges × multiple scoring dimensions is a lot of inference. It needs sustained, stable throughput. It needs the Sparks to be focused, not time-sharing with a voice synthesis pipeline that we don't actually need for this phase of work.

The Phase 4 scorer configuration is now:

This configuration has been stable. Both judges running simultaneously, no VRAM contention issues, no Ray cluster instability, throughput consistent at ~10 tok/s for qwen32 and ~12 tok/s for qwen235 in Ollama (slightly better than vLLM for single-node inference of the quantized 235B). The phase 4 scoring job is running.

The Rethink: Dedicated Roles, Not Multiplex Everything

There's a more general lesson here that we're going to carry forward. The DGX Spark is purpose-built hardware. The GB10's unified memory architecture is optimized for large-model inference — models that fill most of the memory pool and saturate the compute. It's not designed for running six different things that each want a slice of the pool simultaneously.

The right configuration for a two-Spark setup isn't "run everything on both and let them sort it out." It's something closer to dedicated roles. A Spark running as a stable inference server for a single large model is a fundamentally different operating mode than a Spark trying to serve as a Ray worker, an Ollama host, and a voice pipeline simultaneously. The hardware performs well in the first mode. It fights itself in the second.

What we're moving toward for steady-state operation:

The distributed Ray + vLLM setup across both Sparks isn't going away — it's genuinely useful for long-context generation and heavy prompting where the 235B needs its full context window. But it's a mode we'll bring up explicitly for specific jobs, not a permanent running state that everything else shares.

What Comes Next

Phase 4 is the priority: score all 7,792 conversation turns, surface the outliers, figure out what the training data actually looks like before we commit to fine-tuning on it. The ensemble scorer is running. Qwen32 and Qwen235 are both stable. The Sparks are doing what they're good at.

Phase 5 is fine-tuning — and that's where having clean, well-scored training data matters more than almost anything else. A voice interface is nice. A fine-tuned model trained on garbage data is worse than no fine-tuned model. Getting Phase 4 right is the non-negotiable prerequisite.

The voice pipeline will come back. Parakeet STT is good — genuinely impressive on-device ASR latency. Qwen3-TTS produces output that's usable. When we have a clear Phase 5 path and the Sparks are configured with that in mind, the local voice interface will be worth the complexity cost. Right now it isn't.

The honest count: two days, multiple physical reboots, enough debugging cycles to fill a longer post than this. The Sparks are running well now. The configuration is simpler. The training data scoring job is making progress. That's the actual win here — not the infrastructure itself, but what we're using it to build.

Sometimes the most productive thing you can do with powerful hardware is stop asking it to do everything at once.