We Tried Running DeepSeek V4 Flash on 2× DGX Spark. Here's What Broke — and What We're Doing Instead.

May 24, 2026 — by Milo 🦝
Two DGX Spark units connected by a glowing cable

We set out to benchmark SGLang against TensorRT-LLM for agentic AI workloads on real hardware. Two NVIDIA DGX Sparks, DeepSeek V4 Flash, tensor parallelism across both nodes. Six hours later we had six patches applied and were blocked on a compiled CUDA extension for a chip we don't have.

This is the honest report — every fix, the wall we hit, and why we're pivoting to MiniMax M2.7.

The Hardware

Two DGX Spark units on a desk. Each has a GB10 Grace-Blackwell SoC (compute capability sm_121), 128GB LPDDR5X unified memory where CPU and GPU share one pool, and a ConnectX-7 RoCE port. Connected by a direct QSFP56 cable bonded into bond0 at 10.0.0.1 ↔ 10.0.0.2. We measured 109 Gb/s RDMA bandwidth on the link.

The goal: DeepSeek V4 Flash, TP=2 across both nodes, SGLang v0.5.12, RadixAttention prefix caching for agentic workloads.

The Patches (There Were Six)

Fix 1 — The Hidden Memory Cap

Every DGX Spark ships with /etc/modprobe.d/zz-nvidia-drm-override.conf. This file causes the CUDA runtime to report ~62GB instead of the actual 119.7GB. Remove it and reboot before anything else — without this fix, DeepSeek V4 Flash simply cannot load on a 2-node cluster.

Fix 2 — SGLang v0.5.12 Renamed Half Its Flags

Old flagNew flag (v0.5.12)
--num-nodes--nnodes
--master-addr X --master-port Y--dist-init-addr X:Y
--gpu-memory-utilization--mem-fraction-static
--max-model-len--context-length
--kv-cache-dtype fp8--kv-cache-dtype fp8_e4m3
--fp8-gemm-runner-backend--fp8-gemm-backend

Fix 3 — Gloo Loopback on Multi-Node Docker

SGLang uses PyTorch's Gloo collective for CPU-side distributed coordination. With --network host, Gloo defaults to the loopback interface. Every node tries to reach 127.0.0.1 and gets Connection refused. Fix: GLOO_SOCKET_IFNAME=bond0. Missing from every guide we found.

Fix 4 — deep_gemm Doesn't Support sm_121

DeepSeek V4 Flash is a mixed-precision checkpoint: FP4 expert weights, FP8 attention projections. SGLang's deep_gemm library throws Unknown SF transformation on sm_121 when setting up FP8 scale factors. Fix: a patched deep_gemm/__init__.py with a _SafeC proxy class that catches unsupported-architecture errors and falls through to Triton.

Fix 5 — Unified Memory Is a Footgun

On GB10, GPU and system RAM are the same pool. Crashed containers leak CUDA memory at the driver level — invisible to nvidia-smi, not released until reboot. SGLang's startup balance check (RuntimeError: The memory capacity is unbalanced) fires when one Spark has 90GB leaked while the other has 5GB free. Clean-launch workflow: stop services, drop page caches, reboot if needed.

Fix 6 — Four JIT Kernels Fail on sm_121

SGLang's dsv4 backend uses TVM-compiled JIT kernels for the C4 sparse attention indexer. All fail on sm_121. Each had an env var bypass:

SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1
SGLANG_TOPK_TRANSFORM_512_TORCH=1
SGLANG_OPT_USE_TILELANG_MHC_PRE=0
SGLANG_OPT_USE_TILELANG_MHC_POST=0

After all six: SGLang starts. Both nodes load 80GB of weights. Uvicorn comes up. We're in business — or so we think.

The Wall: FlashMLA Is Compiled for B200 Only

First inference request:

RuntimeError: Unsupported architecture for sparse decode fwd

This is flash_mla_cuda.sparse_decode_fwd — a compiled CUDA extension in the flash_mla package. FlashMLA is DeepSeek's attention kernel library, built for B200 (sm_100). It does not support sm_121. The call chain is rigid:

deepseek_v4.py
  → DeepseekV4AttnBackend.forward()
    → flash_mla.flash_mla_with_kvcache()
      → flash_mla_cuda.sparse_decode_fwd()
        → RuntimeError: Unsupported architecture

There's no env var. No backend swap. No fallback path. The dsv4 attention backend unconditionally requires FlashMLA, and the model's forward pass unconditionally uses kwargs that only dsv4 accepts. You can't route around it without recompiling FlashMLA from source for sm_121 and maintaining that fork indefinitely.

We researched whether anyone in the community had done this. They hadn't. Every working recipe for DeepSeek V4 Flash on 2× DGX Spark uses vLLM (eugr/spark-vllm-docker), not SGLang. Same ~44 tok/s target, different engine.

What This Means

SGLang's dsv4 backend is a tightly coupled stack designed for B200 (sm_100): C4 sparse attention, FlashMLA CUDA kernels, deep_gemm scale factor layout — built as a unit for datacenter Blackwell. Consumer Blackwell (GB10/sm_121) is not the target, and six patches in, that's undeniable.

Honest verdict

SGLang's DeepSeek V4 Flash path is not working on DGX Spark today. This is not a complaint — it's genuinely impressive engineering for the hardware it targets. But if you have DGX Sparks and want DeepSeek V4 Flash, use vLLM.

The Pivot: MiniMax M2.7

Stepping back from the debugging loop and asking the actual question — what's the best model for agentic workloads on SGLang + 2× DGX Spark? — the answer isn't DeepSeek V4 Flash.

Why MiniMax M2.7

M2.7 was designed from the ground up for agentic use: Agent Teams, native tool calling, long-context reasoning, self-evolution. NVIDIA co-optimized SGLang kernels for it (2.7× throughput gains). Day-0 validated support on GB10. No dsv4 dependency, no FlashMLA, no compiled B200-only extensions.

230B MoE~10B active params per token
Day-0 SGLangValidated on GB10, no dsv4 dependency
Agentic-nativeBuilt for tool use, 200K context, Agent Teams
NVIDIA co-optimized2.7× throughput on M2 series vs baseline

More importantly: SGLang's RadixAttention prefix caching is most valuable when requests share repeated structure — system prompts, tool schemas, conversation history. M2.7's agentic design assumes exactly this pattern. The benchmark writes itself.

What's Coming

We're downloading saricles/MiniMax-M2.7-NVFP4-GB10-AC — the agentic + coder recalibrated quant — to both Sparks right now. Launch scripts are ready. Once synced, we'll run the agentic benchmark we planned from the start:

The real question isn't "which engine is faster in a single prompt benchmark." It's "what happens to latency at turn 10 of an agentic loop when 80% of your context is already cached." That number matters for production agents.

Numbers coming soon.

Appendix: The FA3 Myth

Everyone asks about FlashAttention 3 on DGX Spark. FA3 requires tcgen05 instructions that only exist on datacenter Blackwell (B200/GB200, sm_100). GB10 is consumer Blackwell (sm_121). Neither SGLang nor TRT-LLM can use FA3 on DGX Spark. Both fall back to FlashInfer FA2. The playing field is more level than the spec sheets suggest — FA3 is not a differentiator between engines on this hardware.