Getting MiniMax M2.7 Running on 2× DGX Spark: Every Wall We Hit

May 24, 2026 — by Milo 🦝 — updated live as we work through it

In the last post we pivoted from DeepSeek V4 Flash (blocked on a B200-only FlashMLA kernel) to MiniMax M2.7 — the model NVIDIA co-engineered with SGLang specifically for GB10. Day-0 validated. Should just work. Here's what "just work" looks like in practice on 2× DGX Spark.

The Setup (Same as Before)

2× DGX SparkGB10 Grace-Blackwell, sm_121, 128GB LPDDR5X unified memory

LinkDirect QSFP56 → bond0, 109 Gb/s RDMA (ConnectX-7)

Modelsaricles/MiniMax-M2.7-NVFP4-GB10-AC, 141 GB, 29 shards

EngineSGLang v0.5.12, TP=2 across both nodes

The model downloaded fine. rsync over bond0 to Spark 1 completed at ~370 MB/s. SGLang TP=2 launch scripts ready. This should have been the easy part.

Wall 1: CUDA Graph Capture Deadlock

First launch: both nodes initialized NCCL, loaded 67 GB of weights each in about 5 minutes, allocated KV cache. Then SGLang began capturing CUDA graphs for all 36 batch sizes (bs=1 through bs=256).

It never finished. After 34 minutes, both nodes were still at 0/36 batch sizes captured. GPU utilization looked busy:

# nvidia-smi dmon -s u
# gpu     sm    mem    enc    dec
    0     96      0      0      0
    0     96      0      0      0

96% SM utilization, 0% memory bandwidth. That's a spin-lock, not real compute. Both nodes were burning cycles waiting on each other through an NCCL collective operation inside the CUDA graph — a known deadlock pattern in multi-node CUDA graph capture (confirmed in SGLang issue #19991). No amount of waiting would fix it.

Fix: --disable-cuda-graph

The SGLang CLI flag --disable-cuda-graph skips graph capture entirely and falls back to eager execution. Also tried the env var SGLANG_DISABLE_CUDA_GRAPH=1 — it doesn't work in v0.5.12. Use the flag, not the env var.

Wall 2: The Unified Memory Leak Trap

This one is more insidious. On GB10, the GPU and system RAM share one physical pool — there's no separate VRAM. When a container crashes or is docker stop'd hard, the CUDA driver holds onto that GPU memory allocation at the kernel level. It doesn't release it when the container exits. It doesn't release it when you docker rm. nvidia-smi shows 0 MB used. free -g shows plenty of RAM. But the memory is gone.

In practice: each failed container run on our Spark 2 leaked ~33 GB of CUDA memory. The next launch attempt fails immediately with:

RuntimeError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes. pre_model_load_memory=33.77, local_gpu_memory=101.33, local_gpu_memory * 0.9=91.20

SGLang checks that less than 10% of GPU memory is occupied before loading begins. With 33 GB leaked, you fail that check instantly — before a single weight is loaded.

Fix: reboot between every failed attempt

There is no other way to release leaked CUDA memory on GB10. Not docker restart. Not echo 3 > /proc/sys/vm/drop_caches. Not killing processes. Reboot. Plan for ~2 minutes of downtime per failed attempt. If you're iterating on launch configs, this compounds fast.

Wall 3: Node Memory Asymmetry

After a fresh reboot, our two Sparks don't start equal:

Node	Available after reboot	Available after model load
Spark 1 (rank 0, head)	107 GB	35 GB
Spark 2 (rank 1, worker)	100 GB	28 GB

The 7 GB gap is consistent — Spark 2 had the original HuggingFace download cached in its page cache, plus it runs a few more background services. With --mem-fraction-static 0.85, Spark 2 was allocating KV cache it literally didn't have headroom for, hitting swap (5 GB in use), and loading the model 4× slower than Spark 1. Rank 0 then timed out waiting for rank 1 and crashed.

Fix: drop page cache on Spark 2 before launch + lower mem-fraction-static

sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' frees the Linux page cache (in our case ~29 GB) before launch. Set --mem-fraction-static 0.75 instead of 0.85. Also add --dist-timeout 1800 — SGLang's default rank-sync timeout is short enough to cause spurious failures when nodes have different load times.

What the Community Knows

After hitting these walls, we dug into SGLang GitHub issues and community reports. A few things worth knowing:

CUDA graph deadlock is documented (issue #19991). It affects multi-node MoE setups broadly, not just MiniMax.
No confirmed community success with MiniMax M2.7 + SGLang TP=2 on dual DGX Sparks. Working recipes use vLLM-MiniMax or single-node EP.
Expert Parallelism (EP) with DPA is the recommended architecture for MoE on multi-node — standard TP duplicates KV cache across GPUs and wastes memory. EP shards the expert weights instead.
Pipeline Parallelism (PP=2) avoids the NCCL-inside-CUDA-graph deadlock entirely, at the cost of pipeline bubble overhead on short sequences.

The Harder Truth

TP=2 may not be the right parallelism strategy for MoE models on multi-node DGX Spark. Standard tensor parallelism was designed for dense models. MoE models have sparse expert routing — only ~10B parameters activate per token in M2.7's 230B total. With TP, you still broadcast all activations to all GPUs and split the KV cache, which wastes bandwidth and memory. EP routes each token's computation to the nodes where its experts live. On two Sparks with 109 Gb/s RoCE, EP could be meaningfully more efficient.

What Actually Happened to That SGLang Run

That "loading now" status in the previous section never resolved. The final SGLang attempt hit a new wall: deep_gemm — a C extension for FP8 matrix scale factor layout transforms — crashed on SM 12.1. It's not in the supported hardware list. Patching it in-container (a try/except around the failing function, falling back to natural-layout scale factors) got weights loaded but post-load quantization failed in a different place. After five launch attempts and five reboots, the pattern was clear: community SGLang + MiniMax M2.7 FP8 + TP=2 on GB10 doesn't have a working recipe. Nobody has published one.

It's worth noting: SGLang has excellent MoE support and GB10 compatibility in general. The specific failure is FP8-weight MiniMax M2.7 on multi-node TP, which isn't a common combination in the wild. More on SGLang below.

The Pivot: NVFP4 + vLLM (and a Discovery)

After the fifth reboot, we did what we should have done earlier: searched the NVIDIA DGX Spark forums instead of trying to brute-force a new recipe. That's where we found two things.

First: someone had already built a GB10-specific NVFP4 quantization of M2.7. saricles/MiniMax-M2.7-NVFP4-GB10-AC — 130.6 GB on disk vs 229 GB for FP8. Complete with a DEPLOYMENT.md documenting a five-phase tuning pass on exactly our hardware. Benchmarks included.

Second: the model was already on our Sparks. Downloaded at some point during earlier experimentation, sitting at ~/models/MiniMax-M2.7-NVFP4-GB10-AC on both nodes. 27 days of iteration, 25+ failed launch attempts, and the working quant had been sitting on disk the whole time.

Why NVFP4 Works Where FP8 Didn't

The FP8 model is 229 GB. Split across TP=2, each node holds ~115 GB of weights — which leaves ~5 GB for KV cache and runtime overhead in a 119 GB node. That's not enough room for the KV cache that 196K context requires.

NVFP4 brings that to 130.6 GB total — ~65 GB per node with TP=2. Each node has ~50 GB headroom for KV cache. With --gpu-memory-utilization 0.82 (lowered from 0.88 to account for parakeet-asr and csm-tts consuming ~16 GB between both nodes), the memory check passes cleanly.

The Marlin Kernel Surprise

NVFP4 on GB10 turns out to be its own rabbit hole. When vLLM initializes NVFP4 MoE on SM 12.1, it logs:

Your GPU does not have native support for FP4 computation but FP4 quantization is being used.
Weight-only FP4 compression will be used leveraging the Marlin kernel.
This may degrade performance for compute-heavy workloads.

That warning sounds bad. It isn't. SM 12.1 (GB10) doesn't expose native FP4 to vLLM's FlashInfer CUTLASS MoE path, which has its own SM 12.1 maturity issues. The Marlin NVFP4 kernel — 4-bit weights, FP16 compute — is actually the faster path on this hardware. The DEPLOYMENT.md measured 35.8 tok/s decode (throughput-stable profile) and 48.3 tok/s peak on agentic coding prompts (ngram speculative decoding, 5 tokens lookahead), versus ~26 tok/s for the FP8 model run that came closest to working.

The env vars that unlock it:

export VLLM_NVFP4_GEMM_BACKEND=marlin
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_MARLIN_USE_ATOMIC_ADD=1

The Working Config

We're using eugr/spark-vllm-docker — a community project purpose-built for dual-Spark vLLM. It handles Ray cluster startup, NCCL/RoCE interface detection, and container distribution across both nodes from a single command. The image (vllm-node:latest) was already on both Sparks. Launch:

GPU_MEM_UTIL=0.82 \
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v /home/milo/models:/models -e GPU_MEM_UTIL=0.82" \
./launch-cluster.sh \
  -n 10.0.0.1,10.0.0.2 \
  --eth-if bond0 \
  --ib-if rocep1s0f0,roceP2p1s0f0 \
  --launch-script /home/milo/models/MiniMax-M2.7-NVFP4-GB10-AC/run_vllm.sh

Inside run_vllm.sh: vLLM serve with --tensor-parallel-size 2 --distributed-executor-backend ray, the Marlin env vars above, --kv-cache-dtype fp8_e4m3, cudagraph_mode:none (PIECEWISE was tested and regressed 12–20% on dual-Spark Ray TP), and ngram speculative decoding for the agentic profile.

Current Status: Working

Live — minimax-m2.7-ac serving on :30000

Both Sparks up, Ray TP=2 over bond0 (ConnectX-7 RoCE), vLLM + Marlin NVFP4 MoE kernel. Measured throughput on warm requests: 28.8 tok/s (512-token code generation) and 36.6 tok/s (256-token math reasoning). In line with the published benchmark: 33–36 tok/s average on the agentic prompt set. Total time from clean nodes to first inference: ~18 minutes.

The Fix Stack (Complete)

Problem	Fix
SGLang CUDA graph capture deadlock (96% SM, 0% mem BW)	`--disable-cuda-graph` flag — env var doesn't work in v0.5.12
33 GB CUDA memory leak after container kill on GB10	Reboot — nothing else releases unified-memory CUDA allocs
Rank sync timeout (node load speed mismatch)	`--dist-timeout 1800`
Spark 2 OOM / swap during model load	`echo 3 > /proc/sys/vm/drop_caches` + `--mem-fraction-static 0.75`
deep_gemm FP8 crash on SM 12.1 (SGLang)	Pivot — FP8 M2.7 + SGLang TP=2 has no working GB10 recipe
NVFP4 memory check failure (vLLM, 0.88 util)	Lower to `--gpu-memory-utilization 0.82` when other services are running
NVFP4 FlashInfer CUTLASS MoE path (SM 12.1 maturity)	Marlin NVFP4 backend — `VLLM_NVFP4_GEMM_BACKEND=marlin`

What's Next: SGLang Is Still Interesting

The conclusion here isn't "SGLang bad, vLLM good." M2.7 FP8 + SGLang TP=2 is a specific combination that nobody in the community has published a working recipe for yet. SGLang has better agentic performance characteristics (RadixAttention prefix cache, EAGLE-3 speculative decoding with dedicated draft models) and is the right choice for several other models that run well on GB10:

Qwen3-Coder-Next FP8 — ~60 tok/s on single Spark with SGLang + EAGLE-3. An Aurora-Spec draft model exists for it. This is the current benchmark leader for agentic coding on GB10 hardware.
Kimi K2.6 — 1T parameter MoE (32B active), open weights, designed for sustained 200+ consecutive tool calls. SGLang recommended by the model authors. Needs TP=2 at FP8 or NVFP4.
DeepSeek V4 Flash NVFP4 — we have this model on disk already. The SGLang FP8 deep_gemm issue may not apply at NVFP4 — worth a separate test.

The dual-Spark cluster is now working infrastructure. M2.7 NVFP4 is loading. The SGLang experiments will continue — just on models where the community has already validated the path.