In the last post we pivoted from DeepSeek V4 Flash (blocked on a B200-only FlashMLA kernel) to MiniMax M2.7 — the model NVIDIA co-engineered with SGLang specifically for GB10. Day-0 validated. Should just work. Here's what "just work" looks like in practice on 2× DGX Spark.
The model downloaded fine. rsync over bond0 to Spark 1 completed at ~370 MB/s. SGLang TP=2 launch scripts ready. This should have been the easy part.
First launch: both nodes initialized NCCL, loaded 67 GB of weights each in about 5 minutes, allocated KV cache. Then SGLang began capturing CUDA graphs for all 36 batch sizes (bs=1 through bs=256).
It never finished. After 34 minutes, both nodes were still at 0/36 batch sizes captured. GPU utilization looked busy:
# nvidia-smi dmon -s u
# gpu sm mem enc dec
0 96 0 0 0
0 96 0 0 0
96% SM utilization, 0% memory bandwidth. That's a spin-lock, not real compute. Both nodes were burning cycles waiting on each other through an NCCL collective operation inside the CUDA graph — a known deadlock pattern in multi-node CUDA graph capture (confirmed in SGLang issue #19991). No amount of waiting would fix it.
The SGLang CLI flag --disable-cuda-graph skips graph capture entirely and falls back to eager execution. Also tried the env var SGLANG_DISABLE_CUDA_GRAPH=1 — it doesn't work in v0.5.12. Use the flag, not the env var.
This one is more insidious. On GB10, the GPU and system RAM share one physical pool — there's no separate VRAM. When a container crashes or is docker stop'd hard, the CUDA driver holds onto that GPU memory allocation at the kernel level. It doesn't release it when the container exits. It doesn't release it when you docker rm. nvidia-smi shows 0 MB used. free -g shows plenty of RAM. But the memory is gone.
In practice: each failed container run on our Spark 2 leaked ~33 GB of CUDA memory. The next launch attempt fails immediately with:
RuntimeError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes. pre_model_load_memory=33.77, local_gpu_memory=101.33, local_gpu_memory * 0.9=91.20
SGLang checks that less than 10% of GPU memory is occupied before loading begins. With 33 GB leaked, you fail that check instantly — before a single weight is loaded.
There is no other way to release leaked CUDA memory on GB10. Not docker restart. Not echo 3 > /proc/sys/vm/drop_caches. Not killing processes. Reboot. Plan for ~2 minutes of downtime per failed attempt. If you're iterating on launch configs, this compounds fast.
After a fresh reboot, our two Sparks don't start equal:
| Node | Available after reboot | Available after model load |
|---|---|---|
| Spark 1 (rank 0, head) | 107 GB | 35 GB |
| Spark 2 (rank 1, worker) | 100 GB | 28 GB |
The 7 GB gap is consistent — Spark 2 had the original HuggingFace download cached in its page cache, plus it runs a few more background services. With --mem-fraction-static 0.85, Spark 2 was allocating KV cache it literally didn't have headroom for, hitting swap (5 GB in use), and loading the model 4× slower than Spark 1. Rank 0 then timed out waiting for rank 1 and crashed.
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' frees the Linux page cache (in our case ~29 GB) before launch. Set --mem-fraction-static 0.75 instead of 0.85. Also add --dist-timeout 1800 — SGLang's default rank-sync timeout is short enough to cause spurious failures when nodes have different load times.
After hitting these walls, we dug into SGLang GitHub issues and community reports. A few things worth knowing:
TP=2 may not be the right parallelism strategy for MoE models on multi-node DGX Spark. Standard tensor parallelism was designed for dense models. MoE models have sparse expert routing — only ~10B parameters activate per token in M2.7's 230B total. With TP, you still broadcast all activations to all GPUs and split the KV cache, which wastes bandwidth and memory. EP routes each token's computation to the nodes where its experts live. On two Sparks with 109 Gb/s RoCE, EP could be meaningfully more efficient.
That "loading now" status in the previous section never resolved. The final SGLang attempt hit a new wall: deep_gemm — a C extension for FP8 matrix scale factor layout transforms — crashed on SM 12.1. It's not in the supported hardware list. Patching it in-container (a try/except around the failing function, falling back to natural-layout scale factors) got weights loaded but post-load quantization failed in a different place. After five launch attempts and five reboots, the pattern was clear: community SGLang + MiniMax M2.7 FP8 + TP=2 on GB10 doesn't have a working recipe. Nobody has published one.
It's worth noting: SGLang has excellent MoE support and GB10 compatibility in general. The specific failure is FP8-weight MiniMax M2.7 on multi-node TP, which isn't a common combination in the wild. More on SGLang below.
After the fifth reboot, we did what we should have done earlier: searched the NVIDIA DGX Spark forums instead of trying to brute-force a new recipe. That's where we found two things.
First: someone had already built a GB10-specific NVFP4 quantization of M2.7. saricles/MiniMax-M2.7-NVFP4-GB10-AC — 130.6 GB on disk vs 229 GB for FP8. Complete with a DEPLOYMENT.md documenting a five-phase tuning pass on exactly our hardware. Benchmarks included.
Second: the model was already on our Sparks. Downloaded at some point during earlier experimentation, sitting at ~/models/MiniMax-M2.7-NVFP4-GB10-AC on both nodes. 27 days of iteration, 25+ failed launch attempts, and the working quant had been sitting on disk the whole time.
The FP8 model is 229 GB. Split across TP=2, each node holds ~115 GB of weights — which leaves ~5 GB for KV cache and runtime overhead in a 119 GB node. That's not enough room for the KV cache that 196K context requires.
NVFP4 brings that to 130.6 GB total — ~65 GB per node with TP=2. Each node has ~50 GB headroom for KV cache. With --gpu-memory-utilization 0.82 (lowered from 0.88 to account for parakeet-asr and csm-tts consuming ~16 GB between both nodes), the memory check passes cleanly.
NVFP4 on GB10 turns out to be its own rabbit hole. When vLLM initializes NVFP4 MoE on SM 12.1, it logs:
Your GPU does not have native support for FP4 computation but FP4 quantization is being used.
Weight-only FP4 compression will be used leveraging the Marlin kernel.
This may degrade performance for compute-heavy workloads.
That warning sounds bad. It isn't. SM 12.1 (GB10) doesn't expose native FP4 to vLLM's FlashInfer CUTLASS MoE path, which has its own SM 12.1 maturity issues. The Marlin NVFP4 kernel — 4-bit weights, FP16 compute — is actually the faster path on this hardware. The DEPLOYMENT.md measured 35.8 tok/s decode (throughput-stable profile) and 48.3 tok/s peak on agentic coding prompts (ngram speculative decoding, 5 tokens lookahead), versus ~26 tok/s for the FP8 model run that came closest to working.
The env vars that unlock it:
export VLLM_NVFP4_GEMM_BACKEND=marlin
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_MARLIN_USE_ATOMIC_ADD=1
We're using eugr/spark-vllm-docker — a community project purpose-built for dual-Spark vLLM. It handles Ray cluster startup, NCCL/RoCE interface detection, and container distribution across both nodes from a single command. The image (vllm-node:latest) was already on both Sparks. Launch:
GPU_MEM_UTIL=0.82 \
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v /home/milo/models:/models -e GPU_MEM_UTIL=0.82" \
./launch-cluster.sh \
-n 10.0.0.1,10.0.0.2 \
--eth-if bond0 \
--ib-if rocep1s0f0,roceP2p1s0f0 \
--launch-script /home/milo/models/MiniMax-M2.7-NVFP4-GB10-AC/run_vllm.sh
Inside run_vllm.sh: vLLM serve with --tensor-parallel-size 2 --distributed-executor-backend ray, the Marlin env vars above, --kv-cache-dtype fp8_e4m3, cudagraph_mode:none (PIECEWISE was tested and regressed 12–20% on dual-Spark Ray TP), and ngram speculative decoding for the agentic profile.
Both Sparks up, Ray TP=2 over bond0 (ConnectX-7 RoCE), vLLM + Marlin NVFP4 MoE kernel. Measured throughput on warm requests: 28.8 tok/s (512-token code generation) and 36.6 tok/s (256-token math reasoning). In line with the published benchmark: 33–36 tok/s average on the agentic prompt set. Total time from clean nodes to first inference: ~18 minutes.
| Problem | Fix |
|---|---|
| SGLang CUDA graph capture deadlock (96% SM, 0% mem BW) | --disable-cuda-graph flag — env var doesn't work in v0.5.12 |
| 33 GB CUDA memory leak after container kill on GB10 | Reboot — nothing else releases unified-memory CUDA allocs |
| Rank sync timeout (node load speed mismatch) | --dist-timeout 1800 |
| Spark 2 OOM / swap during model load | echo 3 > /proc/sys/vm/drop_caches + --mem-fraction-static 0.75 |
| deep_gemm FP8 crash on SM 12.1 (SGLang) | Pivot — FP8 M2.7 + SGLang TP=2 has no working GB10 recipe |
| NVFP4 memory check failure (vLLM, 0.88 util) | Lower to --gpu-memory-utilization 0.82 when other services are running |
| NVFP4 FlashInfer CUTLASS MoE path (SM 12.1 maturity) | Marlin NVFP4 backend — VLLM_NVFP4_GEMM_BACKEND=marlin |
The conclusion here isn't "SGLang bad, vLLM good." M2.7 FP8 + SGLang TP=2 is a specific combination that nobody in the community has published a working recipe for yet. SGLang has better agentic performance characteristics (RadixAttention prefix cache, EAGLE-3 speculative decoding with dedicated draft models) and is the right choice for several other models that run well on GB10:
The dual-Spark cluster is now working infrastructure. M2.7 NVFP4 is loading. The SGLang experiments will continue — just on models where the community has already validated the path.