Qwen3.6-27B: SGLang FP8 + NGRAM vs vLLM NVFP4 + MTP — Two Sparks, Two Stacks

Two DGX Spark units running different inference stacks side by side

Our last Qwen3.6 post closed with Spark 1 running SGLang FP8 at 22+ t/s using NEXTN speculative decoding, and Spark 2 freed up for other workloads. Since then we've been experimenting with Spark 2 running a different stack entirely: vLLM with NV-FP4 quantization and MTP speculative decoding. We ran the same llama-benchy sweep on both today. The comparison is instructive.

The short version: NV-FP4 + MTP on Spark 2 does ~23–27 t/s single-user. SGLang FP8 + NGRAM on Spark 1 does ~13 t/s. There's a reason for the gap, and it's not purely the quantization.

What Changed on Spark 1

The previous post cited NEXTN as delivering 22.6 t/s at c1. That number is still real, but NEXTN is currently broken on the updated scitrera/dgx-spark-sglang:0.5.12 image. The bug is in tokenizer_manager.py — the updated image added speculative_num_draft_tokens into a max() call that defaults to None when not set, causing a TypeError. That's patchable with --speculative-num-draft-tokens 20, but a deeper issue follows: the EAGLE framework pre-allocates mem_fraction_static × total_cuda_memory for the draft model workspace before loading the main model weights. On GB10 with ~112 GB CUDA-visible memory, that pre-allocation consumes ~92 GB, leaving nothing for the 27 GB of FP8 weights. The server OOMs before the model ever loads.

The workaround is NGRAM speculative decoding. NGRAM requires no draft model — it proposes tokens by matching n-gram patterns in the prompt itself. Zero memory overhead, but a much lower accept rate than NEXTN: ~20% vs ~60–70% with a trained draft head. The practical result is ~13 t/s instead of ~22 t/s at c1. Half the speed, but stable.

Fixing NEXTN would require either patching the SGLang container (doable) or waiting for an upstream fix to the pool_configurator pre-allocation logic. We'll get back to it. For now, Spark 1 is the NGRAM baseline.

Spark 2: vLLM + NV-FP4 + MTP

While Spark 1 was fighting the NEXTN OOM, Spark 2 went in a different direction. The stack:

Model: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP — NV-FP4 quantized, text-only, MTP heads pre-trained
Container: ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest — vLLM nightly build for GB10
Quantization: NV-FP4 via --quantization modelopt
Speculative decoding: qwen3_5_mtp method, 3 draft tokens per step
Context: 8192 max tokens, KV cache in FP8

NV-FP4 is NVIDIA's 4-bit floating point format, native to Blackwell (B100, GB10). It's more aggressive than FP8 — halving the weight footprint again, from ~27 GB to ~14 GB. The GB10's unified memory architecture handles this well: the tensor cores have native FP4 support, so there's no dequantization penalty at inference time. The tradeoff is precision; the model variant used here was fine-tuned after quantization to recover quality.

MTP (Multi-Token Prediction) in vLLM is conceptually similar to NEXTN in SGLang — trained draft heads that predict multiple tokens simultaneously. The difference is implementation: vLLM's MTP speculative config accepts 3 draft tokens per step (num_speculative_tokens: 3), whereas NEXTN with --speculative-num-steps 5 --speculative-eagle-topk 4 tries a wider beam. The accept rate with 3 tokens is high enough that single-user generation improves substantially over the base decode rate.

The Benchmark

Same llama-benchy sweep on both: pp=512 and 2048, tg=128, depth=0 and 4096, concurrency 1 and 8, single run each. Direct comparison, same date, same corpus.

Generation Throughput

Config	Spark 1 SGLang FP8 + NGRAM	Spark 2 vLLM NVFP4 + MTP	Delta
tg128, depth=0, c1	12.7 t/s	23.3 t/s	+84%
tg128, depth=0, c8 (agg)	43.1 t/s	61.0 t/s	+42%
tg128, depth=4096, c1	12.2 t/s	23.2 t/s	+90%
tg128, depth=4096, c8 (agg)	19.4 t/s	44.7 t/s	+130%

Spark 2 wins at every configuration. The gap is largest at depth=4096, c8 — 44.7 vs 19.4 t/s. This is partly MTP vs NGRAM, and partly a vLLM KV cache scheduling difference: vLLM's continuous batching handles long-context concurrent requests more efficiently in this configuration.

Prefill Throughput and TTFT

Config	Spark 1 TTFT	Spark 2 TTFT	Spark 1 PP t/s	Spark 2 PP t/s
pp=512, depth=0, c1	412 ms	760 ms	1,253 t/s	598 t/s
pp=2048, depth=0, c1	1,637 ms	1,004 ms	1,254 t/s	1,772 t/s
pp=512, depth=4096, c1	3,439 ms	1,715 ms	1,341 t/s	2,271 t/s
pp=2048, depth=4096, c1	4,725 ms	2,371 ms	1,301 t/s	2,162 t/s

Prefill is a split story. At pp=512 with no prior context, Spark 1 (SGLang FP8) is faster: 412 ms vs 760 ms TTFT, 1,253 vs 598 t/s. This is likely a vLLM initialization overhead on short prompts — the first-token machinery for the MTP speculative path adds latency that isn't amortized until the prompt is longer.

Everywhere else, Spark 2 wins on prefill too. At depth=4096, Spark 2 hits 2,271 t/s prefill throughput vs 1,341 t/s on Spark 1. The NV-FP4 weights are roughly half the size of FP8, so memory bandwidth per prefill token is correspondingly lower — the math works out. SGLang's prefill throughput is consistent (~1,300 t/s across all configs) where vLLM's scales with the workload.

Why NGRAM Falls Behind MTP

NGRAM speculative decoding is clever but limited. It works by scanning the current prompt for n-gram patterns that predict the next token: if the model just generated "the quick brown", and "the quick brown fox" appeared earlier in the prompt, NGRAM proposes "fox" as the next token. No model needed, no memory overhead, instant proposals.

The problem is accept rate. For Qwen3.6 generating reasoning traces and code, prompt n-gram matches are uncommon. We observed accept lengths of ~1.65–1.85 tokens per step — meaning most steps accept just one token (the verified base token), with occasional multi-token windfalls when a phrase repeats. At an accept rate this low, speculative decoding barely outperforms no speculation at all.

MTP with trained draft heads is a different category. The draft network has internalized the base model's distribution well enough that 3-token proposals are accepted at high rates — measured accept lengths in the 2.5–3.5 range for this model variant. The effect is visible: 23 t/s vs 13 t/s, with the same underlying base model and similar hardware.

The NEXTN/NGRAM gap explains the full 22.6 → 13 t/s drop from the previous post. It wasn't a regression in the model or the hardware — it was speculative decoding quality falling off a cliff.

Full Results Table

PP	Depth	Conc	S1 TTFT	S2 TTFT	S1 TG t/s	S2 TG t/s	S1 /req	S2 /req
512	0	1	412	760	12.7	23.3	12.7	23.3
512	0	8	2,778	9,422	43.1	61.0	9.5	21.8
2048	0	1	1,637	1,004	13.4	27.4	13.4	27.4
2048	0	8	9,280	7,398	35.3	66.0	7.7	20.1
512	4096	1	3,439	1,715	12.2	23.2	12.2	23.2
512	4096	8	17,842	11,322	19.4	44.7	5.5	16.8
2048	4096	1	4,725	2,371	12.8	21.6	12.8	21.6
2048	4096	8	23,410	16,073	17.2	44.5	5.5	16.8

The Across-Posts Picture

We now have four data points for Qwen3.6-27B on this hardware:

Config	TG c1 (t/s)	TG c8 agg (t/s)	Notes
SGLang FP8 + NEXTN (Spark 1)	22.6	95.3	Old image, working NEXTN
vLLM NVFP4 + MTP (Spark 2)	23–27	44–66	Current, 8192 ctx limit
SGLang FP8 + NGRAM (Spark 1)	12–13	17–43	Current, NEXTN broken
SGLang FP8, TP=2 (both Sparks)	8.1	40.9	NCCL sync overhead

The pattern is consistent: speculative decoding quality is the primary lever for single-user throughput on this hardware. NEXTN (trained, 5-step beam) and MTP (trained, 3 draft tokens) both deliver ~22–27 t/s. NGRAM (untrained, n-gram matching) delivers 12–13 t/s. No speculative decoding at all would be around 10–11 t/s based on the accept-length data. Tensor parallelism across two nodes adds NCCL overhead that dominates at 27B scale.

What We're Doing About Spark 1

The NEXTN OOM is a fixable bug. The pool_configurator in the updated image needs to defer draft model workspace allocation until after weights load, or respect a tighter mem_fraction_static. We'll work through it. In the meantime, Spark 1 stays on NGRAM and Spark 2 runs the vLLM NV-FP4 stack for workloads that need faster single-user responses. Both are served through OpenClaw and available as separate providers.

The NV-FP4 result is encouraging — it suggests that further quantization beyond FP8 doesn't cost much at interactive latency, and the native Blackwell FP4 tensor core path runs cleanly. The 8192 max context is a real constraint for some workloads; the FP8 model on Spark 1 can handle much longer contexts. For most interactive tasks, 8K is plenty.

Conclusion

Two Sparks, two stacks, same model family — and trained speculative decoding is the deciding factor in both cases. NV-FP4 + MTP on vLLM is the current single-user speed champion at ~24 t/s. SGLang FP8 + NGRAM is the fallback at ~13 t/s with longer context support. Getting NEXTN working again on Spark 1 would close the gap and give us two independent 22+ t/s servers — that's the next goal.