Single NVIDIA DGX Spark with speculative decoding token streams

Our previous post covered MiniMax M2.7 across two DGX Sparks with vLLM and Ray. The conclusion was honest: 12 t/s single-request generation is slow, model scale is the ceiling, and the setup is most compelling for batch workloads rather than interactive use. The post ended with a teaser about trying Qwen3.6-27B-FP8 on a single Spark with SGLang and speculative decoding. This is that post.

The question was simple: is a smaller, faster model on fewer machines a better fit for interactive and agentic workloads than a larger model spread across two? The answer turned out to be a fairly clear yes — with some interesting mechanics behind why.

The Setup

Same hardware: DGX Spark (GB10 SoC), 119 GiB unified memory, SM 12.1. But instead of two units with Ray coordinating over the 200Gbps copper cluster link, we used just Spark 1. No distributed overhead, no Gloo allreduce, no inter-node coordination at all.

The stack:

Qwen3.6-27B-FP8 weighs about 27 GB. With 119 GiB available, the entire model fits in one Spark with ~90 GiB left over for KV cache. That's a fundamentally different operating regime than M2.7, which at 115 GB required the full memory of two units just to load.

Why SGLang Instead of vLLM

For single-GPU deployments on GB10, SGLang has a smoother path right now. The scitrera/dgx-spark-sglang image is purpose-built for the SM 12.1 architecture and ships with a functional CUDA toolkit for GB10 — no compilation needed, no wheel hunting. vLLM on GB10 works but requires more environment coaxing, especially around the sm_121 vs sm_120 architecture mismatch and custom CUDA build flags.

SGLang also has first-class support for speculative decoding via NEXTN, which was the main reason to try it here.

NEXTN Speculative Decoding

Speculative decoding is a throughput technique that exploits idle compute during autoregressive generation. The standard decode loop is sequential: generate one token, pass it forward, repeat. The GPU spends most of each step waiting — it's memory-bandwidth-bound, not compute-bound.

Speculative decoding adds a fast draft mechanism that proposes multiple tokens at once. The verifier (the full model) checks the entire draft in one forward pass, accepting correct tokens and stopping at the first mismatch. If the draft is right, you get multiple tokens from the cost of one verification step. If it's wrong, you fall back to the verified token and try again.

NEXTN is a variant where the draft heads are trained directly into the model's architecture rather than being a separate smaller model. Qwen3.6 ships with these heads built in. The practical effect is speculative decoding with no additional model to load and no separate process to manage — it's just a flag at serve time.

Two DGX Spark units with tensor parallelism data flow
The M2.7 approach: split 115 GB of weights across two units connected by 200Gbps copper. Fast link, but the memory bandwidth per token is the real cost.

Launch Command

Drop page caches first — same rule as with M2.7, always:

sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'

Then launch SGLang:

docker run -d --name sglang-qwen36 \
  --gpus all --shm-size 16g \
  --network host \
  -v /home/milo/models:/models \
  scitrera/dgx-spark-sglang:0.5.12 \
  python -m sglang.launch_server \
    --model-path /models/Qwen3.6-27B-FP8 \
    --host 0.0.0.0 --port 8003 \
    --served-model-name qwen3.6-27b \
    --tp 1 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 4 \
    --mem-fraction-static 0.82 \
    --trust-remote-code

Startup is fast — about 90 seconds on a warm filesystem cache. Watch for The server is fired up and ready to roll! in the logs.

Benchmark Results

Measured with llama-benchy at the same sweep as the M2.7 benchmark for direct comparison. All numbers from the generation_tokens_per_second metric.

Generation Throughput (depth=0)

Config c1 c4 c8
tg128 22.6 t/s 54.3 t/s 95.3 t/s
tg256 18.7 t/s 56.1 t/s 97.8 t/s
tg128 peak burst (NEXTN) 170 t/s

Prefill Throughput

Config c1 t/s c1 TTFT c4 t/s
pp2048, depth=0 1,500 t/s 1,377 ms 950 t/s
pp2048, depth=16K 1,541 t/s 1,340 ms 949 t/s

Head-to-Head: Qwen3.6 vs MiniMax M2.7

Metric Qwen3.6-27B-FP8 (1 Spark) MiniMax M2.7 (2 Sparks) Delta
TG c1 (t/s) 22.6 12.4 +82%
TG c8 aggregate (t/s) 95 ~65 +46%
Peak burst (t/s) 170 ~65 +162%
Prefill c1 (t/s) 1,500 1,004 +49%
Hardware required 1 Spark 2 Sparks
Benchmark runtime ~3 hours 7+ hours (OOM at depth=65K)
OOM events None 1 (c10, depth=100K)

The throughput gap is larger than expected. Qwen3.6 is faster at every concurrency level despite being a smaller model, running on half the hardware. The throughput difference comes down to model scale and memory bandwidth. The two Sparks are connected via 200Gbps copper — the link is fast. The real cost is that M2.7's 115 GB of weights require streaming a huge amount of data through unified memory on every decode step, regardless of how fast the nodes communicate. Qwen3.6 at 27 GB reads far less per token, and on a single Spark there's no synchronization overhead at all.

NEXTN in Practice

The 170 t/s peak burst at c8 is the NEXTN effect. Sustained throughput at c8 is about 95 t/s — the speculative decoding bursts happen when drafts are accepted at a high rate, which varies with prompt content. Code and structured data tend to accept more drafts than open-ended prose, because the distribution is more predictable.

At c1, NEXTN gives about 20% uplift (22.6 vs ~18-19 without it). The benefit is most visible at moderate concurrency where the GPU has spare compute to run draft verification in parallel with normal batch generation. At very high concurrency, the parallelism headroom shrinks and the advantage narrows.

The key point is that NEXTN costs nothing to enable for Qwen3.6 — the draft heads are in the checkpoint. It's not a separate model, not a separate process, and it doesn't increase memory footprint. There's no reason not to use it.

What We Gave Up

Being honest about the tradeoff: Qwen3.6-27B is a smaller model than MiniMax M2.7. The M2.7 benchmark here was throughput-only; we didn't run quality evaluations. Larger models generally have better reasoning depth, stronger instruction following on complex tasks, and more factual breadth. If you're running research workloads that benefit from model scale, the M2.7 dual-Spark setup is still the right call despite the throughput penalty.

For interactive agent workloads — where response latency matters and most requests are <4K tokens — Qwen3.6 on a single Spark is the clearer choice. The 22 t/s sustained single-request generation is comfortable for interactive use. The 12 t/s from M2.7 is noticeable as lag.

Spark 2 Is Now Free

One practical consequence: with Qwen3.6 on Spark 1 only, Spark 2 is available for independent work. We're running Orpheus TTS on Spark 2 simultaneously without any coordination overhead. That's a configuration that was impossible when both Sparks were locked into the M2.7 Ray cluster.

The dual-Spark setup is still available for workloads that genuinely require model scale. But for the majority of our agent workloads, the single-Spark configuration with a well-fit model is the better operating point.

An Honest Note on TP=2 for Qwen3.6

The two Sparks are connected via 200Gbps copper cluster ports — fast enough that allreduce overhead is negligible for a 27B model. That means TP=2 across both Sparks would likely split the memory bandwidth work in half, and could push single-request throughput from ~22 t/s toward ~35–40 t/s. We didn't claim single-Spark is the fastest possible configuration — we chose it because 22 t/s is comfortable for interactive use and Spark 2 is more valuable free than locked into a cluster. We're running a TP=2 benchmark to verify the actual number; results will follow in a separate post.

The Working Launch Config Reference

docker run -d --name sglang-qwen36 \
  --gpus all --shm-size 16g \
  --network host \
  -v /home/milo/models:/models \
  scitrera/dgx-spark-sglang:0.5.12 \
  python -m sglang.launch_server \
    --model-path /models/Qwen3.6-27B-FP8 \
    --host 0.0.0.0 --port 8003 \
    --served-model-name qwen3.6-27b \
    --tp 1 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 4 \
    --mem-fraction-static 0.82 \
    --trust-remote-code

Tested on DGX Spark GB10 (SM 12.1), SGLang 0.5.12. Drop page caches before every cold start. Model: Qwen/Qwen3.6-27B-FP8 from HuggingFace.