We set out to benchmark SGLang against TensorRT-LLM for agentic AI workloads on real hardware. Two NVIDIA DGX Sparks, DeepSeek V4 Flash, tensor parallelism across both nodes. Six hours later we had six patches applied and were blocked on a compiled CUDA extension for a chip we don't have.
This is the honest report — every fix, the wall we hit, and why we're pivoting to MiniMax M2.7.
Two DGX Spark units on a desk. Each has a GB10 Grace-Blackwell SoC (compute capability sm_121), 128GB LPDDR5X unified memory where CPU and GPU share one pool, and a ConnectX-7 RoCE port. Connected by a direct QSFP56 cable bonded into bond0 at 10.0.0.1 ↔ 10.0.0.2. We measured 109 Gb/s RDMA bandwidth on the link.
The goal: DeepSeek V4 Flash, TP=2 across both nodes, SGLang v0.5.12, RadixAttention prefix caching for agentic workloads.
Every DGX Spark ships with /etc/modprobe.d/zz-nvidia-drm-override.conf. This file causes the CUDA runtime to report ~62GB instead of the actual 119.7GB. Remove it and reboot before anything else — without this fix, DeepSeek V4 Flash simply cannot load on a 2-node cluster.
| Old flag | New flag (v0.5.12) |
|---|---|
--num-nodes | --nnodes |
--master-addr X --master-port Y | --dist-init-addr X:Y |
--gpu-memory-utilization | --mem-fraction-static |
--max-model-len | --context-length |
--kv-cache-dtype fp8 | --kv-cache-dtype fp8_e4m3 |
--fp8-gemm-runner-backend | --fp8-gemm-backend |
SGLang uses PyTorch's Gloo collective for CPU-side distributed coordination. With --network host, Gloo defaults to the loopback interface. Every node tries to reach 127.0.0.1 and gets Connection refused. Fix: GLOO_SOCKET_IFNAME=bond0. Missing from every guide we found.
DeepSeek V4 Flash is a mixed-precision checkpoint: FP4 expert weights, FP8 attention projections. SGLang's deep_gemm library throws Unknown SF transformation on sm_121 when setting up FP8 scale factors. Fix: a patched deep_gemm/__init__.py with a _SafeC proxy class that catches unsupported-architecture errors and falls through to Triton.
On GB10, GPU and system RAM are the same pool. Crashed containers leak CUDA memory at the driver level — invisible to nvidia-smi, not released until reboot. SGLang's startup balance check (RuntimeError: The memory capacity is unbalanced) fires when one Spark has 90GB leaked while the other has 5GB free. Clean-launch workflow: stop services, drop page caches, reboot if needed.
SGLang's dsv4 backend uses TVM-compiled JIT kernels for the C4 sparse attention indexer. All fail on sm_121. Each had an env var bypass:
SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1
SGLANG_TOPK_TRANSFORM_512_TORCH=1
SGLANG_OPT_USE_TILELANG_MHC_PRE=0
SGLANG_OPT_USE_TILELANG_MHC_POST=0
After all six: SGLang starts. Both nodes load 80GB of weights. Uvicorn comes up. We're in business — or so we think.
First inference request:
RuntimeError: Unsupported architecture for sparse decode fwd
This is flash_mla_cuda.sparse_decode_fwd — a compiled CUDA extension in the flash_mla package. FlashMLA is DeepSeek's attention kernel library, built for B200 (sm_100). It does not support sm_121. The call chain is rigid:
deepseek_v4.py
→ DeepseekV4AttnBackend.forward()
→ flash_mla.flash_mla_with_kvcache()
→ flash_mla_cuda.sparse_decode_fwd()
→ RuntimeError: Unsupported architecture
There's no env var. No backend swap. No fallback path. The dsv4 attention backend unconditionally requires FlashMLA, and the model's forward pass unconditionally uses kwargs that only dsv4 accepts. You can't route around it without recompiling FlashMLA from source for sm_121 and maintaining that fork indefinitely.
We researched whether anyone in the community had done this. They hadn't. Every working recipe for DeepSeek V4 Flash on 2× DGX Spark uses vLLM (eugr/spark-vllm-docker), not SGLang. Same ~44 tok/s target, different engine.
SGLang's dsv4 backend is a tightly coupled stack designed for B200 (sm_100): C4 sparse attention, FlashMLA CUDA kernels, deep_gemm scale factor layout — built as a unit for datacenter Blackwell. Consumer Blackwell (GB10/sm_121) is not the target, and six patches in, that's undeniable.
SGLang's DeepSeek V4 Flash path is not working on DGX Spark today. This is not a complaint — it's genuinely impressive engineering for the hardware it targets. But if you have DGX Sparks and want DeepSeek V4 Flash, use vLLM.
Stepping back from the debugging loop and asking the actual question — what's the best model for agentic workloads on SGLang + 2× DGX Spark? — the answer isn't DeepSeek V4 Flash.
M2.7 was designed from the ground up for agentic use: Agent Teams, native tool calling, long-context reasoning, self-evolution. NVIDIA co-optimized SGLang kernels for it (2.7× throughput gains). Day-0 validated support on GB10. No dsv4 dependency, no FlashMLA, no compiled B200-only extensions.
More importantly: SGLang's RadixAttention prefix caching is most valuable when requests share repeated structure — system prompts, tool schemas, conversation history. M2.7's agentic design assumes exactly this pattern. The benchmark writes itself.
We're downloading saricles/MiniMax-M2.7-NVFP4-GB10-AC — the agentic + coder recalibrated quant — to both Sparks right now. Launch scripts are ready. Once synced, we'll run the agentic benchmark we planned from the start:
The real question isn't "which engine is faster in a single prompt benchmark." It's "what happens to latency at turn 10 of an agentic loop when 80% of your context is already cached." That number matters for production agents.
Numbers coming soon.
Everyone asks about FlashAttention 3 on DGX Spark. FA3 requires tcgen05 instructions that only exist on datacenter Blackwell (B200/GB200, sm_100). GB10 is consumer Blackwell (sm_121). Neither SGLang nor TRT-LLM can use FA3 on DGX Spark. Both fall back to FlashInfer FA2. The playing field is more level than the spec sheets suggest — FA3 is not a differentiator between engines on this hardware.