We Ran Qwen3.6-27B on Two DGX Sparks. Single-Spark Still Wins.

Two DGX Spark units with tensor parallelism — Qwen3.6 TP=2 benchmark

In our previous post we benchmarked Qwen3.6-27B-FP8 on a single DGX Spark and noted, honestly, that our prediction of TP=2 giving ~35–40 t/s might be wrong — we hadn't tested it. The two Sparks connect via 200Gbps copper cluster ports, which is fast enough that allreduce overhead should be negligible. We said we'd run it. Here's what happened.

The Setup

Same model, same container (scitrera/dgx-spark-sglang:0.5.12), both Sparks. Multi-node SGLang with --nnodes 2 --tp 2 --dist-init-addr 10.0.0.1:20000, using the 200Gbps bond0 cluster interface for NCCL. No speculative decoding — NEXTN multi-node support in this SGLang version requires topk=1, which is a different workload than our single-Spark config. The TP=2 run is the clean baseline: distributed inference, no draft heads.

The Numbers

Test	TP=1 single Spark (NEXTN)	TP=2 dual Spark (no NEXTN)	Delta
tg128 c1	22.6 t/s	8.1 t/s	−64%
tg128 c4 (agg)	54.3 t/s	25.5 t/s	−53%
tg128 c8 (agg)	95.3 t/s	40.9 t/s	−57%
tg256 c1	18.7 t/s	8.7 t/s	−53%
tg256 c8 (agg)	97.8 t/s	41.7 t/s	−57%
prefill pp512 c1	931 t/s	601 t/s	−35%

TP=2 is roughly half the speed of TP=1 at every concurrency level. The single-Spark number includes NEXTN speculative decoding; without it, a rough estimate is ~18–19 t/s at c1. TP=2 still loses.

Why TP=2 Loses on a 27B Model

The theoretical argument for TP=2 was: split the weights across two GPUs, each reads half as much memory per decode step, throughput doubles. This reasoning is sound for the memory bandwidth component. What it underestimates is the allreduce latency.

With TP=2, every single decode step requires a synchronization barrier across both nodes — at minimum once per transformer layer, more in practice. For a 28-layer model, that's 28+ round-trips per token. Even at 200Gbps, inter-node NCCL communication involves:

Kernel launch overhead and buffer setup on each side
Actual data transfer (small per layer — ~20KB of activations)
Synchronization barrier wait
CPU-GPU synchronization on each step

The data transfer itself is fast. The synchronization overhead per layer is not. At c1, with no compute to hide behind, this adds up to ~70ms of overhead per token. The result is 8 t/s instead of the theoretical ~40 t/s.

This is a well-known distributed inference problem: tensor parallelism benefits are only realized when the model is large enough that compute time per step dominates the communication cost. For a 27B model on fast hardware, we're not in that regime. For a 500B+ model, the calculus is different — compute per step is much larger, and the communication cost is proportionally smaller.

The MiniMax M2.7 Case Makes More Sense Now

Our MiniMax M2.7 benchmark showed 12.4 t/s at c1 with TP=2 over the same cluster link. At the time we attributed the slowness to memory bandwidth (115 GB of weights). In hindsight, the communication overhead is a larger factor than we initially credited — even M2.7 on TP=2 is limited by the per-step synchronization cost. The fact that M2.7 gets 12.4 t/s vs Qwen3.6 TP=2's 8.1 t/s is likely because M2.7 has more compute per token (more parameters active per step) which partially amortizes the communication cost.

The Corrected Conclusion

The original post said single-Spark was the right choice. That conclusion stands. The explanation got refined: it's not primarily memory bandwidth splitting that makes TP=2 uncompetitive here — it's that per-step synchronization overhead dominates for sub-100B models on this hardware. 200Gbps is fast; NCCL kernel overhead is not free regardless of link speed.

Single-Spark TP=1 with NEXTN speculative decoding is back online at 22+ t/s. Spark 2 is free for TTS and other workloads. The original operational choice was correct for the wrong stated reason, and now we have the data to explain it properly.

TP=2 Full Benchmark Results

For completeness, across all depths tested:

Test	t/s total	t/s per req	peak t/s
tg128 d=0 c1	8.1	8.1	9.7
tg128 d=0 c4	25.5	6.9	33.3
tg128 d=0 c8	40.9	5.9	58.7
tg256 d=0 c1	8.7	8.7	10.3
tg256 d=0 c8	41.7	5.6	50.7
tg128 d=4096 c1	6.9	6.9	8.0
tg128 d=16384 c1	8.7	8.7	9.7
tg128 d=65536 c1	6.0	6.0	7.0
pp512 d=0 c1	601	601	—

No OOM events across the full sweep. The server was stable. It just wasn't fast.