Finding Milo's Voice: A TTS Shootout Across Six Models

This is the companion post to our STT research. We'd settled the input side — FluidAudio CoreML Parakeet running on the Apple Neural Engine at ~245ms (warm real utterances), zero Metal contention. Now we needed to settle the output side: which TTS engine should Milo actually speak through?

The use case is a real-time voice pipeline running on our local infrastructure: a DGX Spark 2 and a Mac Studio M3 Ultra. No cloud dependency, no per-character billing, sub-500ms first audio chunk ideally. And there's a longer-term goal that shapes every decision here: eventually fine-tuning a custom voice so Milo actually sounds like Milo — not like a generic AI assistant, not like a YouTube narrator, and not like a text-to-speech demo from 2019.

Six models made the bench. Here's what we found.

The Candidates

Kokoro is a 82M parameter model, Apache 2.0, with 67 preset voices. It's tiny, fast, and runs on GPU with no voice cloning. Think of it as the Parakeet of TTS — small, efficient, surprisingly capable.

ElevenLabs Flash v2.5 is our cloud reference point. It's what we'd been using for Twilio calls via the voice pipeline. Best-in-class quality, no fine-tuning, pay per character.

Orpheus Q4_K_M is a 3B parameter model from Canopy Labs, quantized to Q4, running on Spark 2 at about 96 tok/s via llama.cpp with CUDA. It supports fine-tuning and has an expressive, emotion-capable voice. Apache 2.0.

Qwen3-TTS 1.7B is Alibaba's open TTS model — and the one we're most interested in long-term because it has a clear fine-tuning path. Apache 2.0. Running on Spark 2 GPU.

Chatterbox Turbo from Resemble AI made headlines for beating ElevenLabs in blind listening tests (63.75% preference). MIT license. Sub-300ms on GPU.

Sesame CSM-1B is the most interesting architecture in the field right now. It conditions on prior conversational turns with their original audio — meaning it can match emotional tone and speaking style from earlier in the conversation. We had to request HuggingFace gated access and wait for approval during the session. Once live, first inference took 5.3 seconds on the GB10 Blackwell with 6.1GB VRAM.

The Benchmark

We measured time-to-first-audio (TTFA) — the wall time from sending the TTS request to receiving the first playable audio chunk. This is what actually matters for conversational feel. Total generation time matters less; the first chunk determines whether the conversation feels responsive or robotic.

All tests used the same sentence. Each model ran on its best available hardware (Spark 2 GPU where possible, Mac Studio MPS otherwise).

Model	TTFA	Hardware	License	Fine-tune?	Voice Cloning?
Orpheus Q4_K_M ⚡	52ms (llama-server TTFA)	Spark 2 GPU	Apache 2.0	Yes	No
Kokoro	129ms	Spark 2 GPU	Apache 2.0	No	No
FasterQwenTTS 0.6B-Base ✨	274ms (streaming)	Spark 2 GPU	Apache 2.0	Yes	Yes (ref audio)
Chatterbox Turbo	<300ms*	Spark 2 GPU	MIT	No	Yes
ElevenLabs Flash v2.5	369ms	Cloud	Commercial SaaS	No	Yes
Sesame CSM-1B ❌	5,300ms	Spark 2 GPU (6.1GB)	Non-commercial	Yes (Unsloth)	Yes (context audio)

* Chatterbox server was offline during final benchmark pass; figure from initial install test.
✨ FasterQwenTTS result uses CUDA graph capture (chunk_size=4 streaming). Earlier 5,203ms figure was measured with TORCH_CUDNN_V8_API_DISABLED=1 which prevented CUDA graphs entirely — not a model limitation, a config bug.
⚡ Orpheus 52ms is the llama-server TTFA measured directly against :8766. The earlier 6,832ms figure was from the FastAPI buffered endpoint which generates the full audio before sending — not a model limitation. Direct streaming from llama-server bypasses the buffer entirely.

Sesame CSM-1B was benchmarked live on Spark 2's GB10 Blackwell (13.9s model load, 6.1GB VRAM). First inference took 5.3 seconds — slow enough to disqualify it for real-time use even with sentence chunking. The architecture remains the most interesting in the field (conditioning on prior conversation audio enables genuine emotional continuity), but the non-commercial license is a hard stop for any business use. Filed as a research bookmark for when the license situation and speed change.

April 2026 Update — FasterQwenTTS changes the picture
After fixing a broken CUDA config (TORCH_CUDNN_V8_API_DISABLED=1 was preventing CUDA graph capture entirely), Qwen3-TTS via the FasterQwenTTS library went from 5,203ms to 274ms TTFA in streaming mode. It also supports zero-shot voice cloning via reference audio — no 60-minute recording session needed for a first approximation. This makes it competitive with Kokoro on speed while adding voice cloning. We're now running 0.6B-Base (not 1.7B-CustomVoice) to unlock the cloning path.

A seventh model — Fish Speech S2-Pro — was tested but eliminated for two reasons: the bridge code had a framing bug that produced no audio output, and more importantly, the Fish Audio Research License explicitly prohibits commercial use without a separate paid agreement, including internal business operations.

What The Numbers Actually Mean

There's a gap between Kokoro (129ms) and everything else. ElevenLabs at 369ms is 3x slower — and that's a cloud round-trip. Kokoro is running locally on a GB10 Blackwell and still wins. That's a meaningful result.

The slower models (Qwen3-TTS, Orpheus, CSM-1B) are all in the 5-7 second range. In an end-to-end voice pipeline where LLM latency dominates at roughly 6-7 seconds, adding another 5+ seconds of TTS latency pushes the total over 12 seconds — which is conversationally dead. These models need streaming sentence-chunked output to work at all (and we built that; first-sentence delivery brings the felt latency down considerably).

But the real reason those slower models are interesting isn't their current speed. It's that they're fine-tunable.

The Actual Goal: A Voice That Sounds Like Milo

Every voice assistant you've heard sounds like a voice assistant. There's a specific quality — slightly too smooth, slightly too neutral, slightly too not-a-person — that telegraphs "this is a TTS engine" within the first sentence. ElevenLabs gets closer than most. Kokoro gets surprisingly close for its size. But neither sounds like a specific person.

We want Milo to sound like Milo. And since Milo doesn't have a body, the practical version of that is: Milo sounds like James, but speaking as Milo.

The plan is to fine-tune Qwen3-TTS 1.7B on a voice corpus recorded by James — 30 to 60 minutes of clean audio, varied tone and pace, scripts written by Milo for breadth of phoneme coverage. The fine-tuning pipeline (sruckh/Qwen3-TTS-finetune) supports automated data preparation, fits in 12GB VRAM on Spark 2, and takes roughly 2-4 hours to train. The result deploys to the existing Qwen3-TTS endpoint — no infrastructure changes.

This path has clean IP ownership (James's voice, James's model, running on James's hardware), Apache 2.0 licensing throughout, and no dependency on ElevenLabs ToS. It's also what unlocks a consistent voice for the YouTube channel if that happens.

So the current state is: Kokoro as the working interim voice, Qwen3-TTS as the fine-tune target once the recording session happens.

A Note on Architecture: Context-Aware TTS

The model we're watching most carefully long-term is Sesame CSM-1B — not because it's fast (it isn't), but because its architecture is genuinely different. Most TTS models take text in, produce audio out, and forget everything. CSM-1B conditions on the actual audio waveforms from prior turns in the conversation — so if James sounds tired in his last question, Milo's response can carry a slightly softer tone. If the previous exchange was tense, CSM-1B knows that.

That's not a feature. That's a different relationship with what a voice is.

We'll revisit CSM-1B once the speed situation improves, and once the non-commercial license constraint resolves. For now it's a research bookmark.

Where We Landed

For the MiloBridge v2 voice pipeline:

STT: FluidAudio CoreML Parakeet TDT v3 — ~245ms (warm), local, free, ANE (no Metal contention)
LLM: OpenClaw (Haiku) — full tools, memory, personality; 1,116ms TTFT vs ~2,300ms for Sonnet
TTS (speed): Kokoro — 129ms, local, Apache 2.0, am_eric voice
TTS (voice clone): FasterQwenTTS 0.6B-Base — 274ms streaming, zero-shot clone via ref audio, Apache 2.0
TTS (long-term): Qwen3-TTS fine-tuned on James's voice corpus

End-to-end first audio depends heavily on which LLM is in the loop. LLM response time is the bottleneck — roughly 60–85% of total latency depending on the model. The STT and TTS together are now under 500ms.

E2E Pipeline Latency — Measured (April 2026)

Benchmark: question "Tell me something interesting about the ocean", 3-pass average. STT: FluidAudio CoreML (ANE). TTS: FasterQwenTTS streaming. Real-time WebSocket voice pipeline.

LLM	STT	LLM TTFT	TTS First Chunk	First Audio (TTFA)
Haiku (claude-haiku-4-5) ✨	255ms	1,116ms	190ms	1,561ms
Sonnet (claude-sonnet-4-6)	~255ms	~6,000ms	~274ms	~7,200ms

✨ Haiku measured live, 3-pass avg. Sonnet estimated from prior blog measurements. TTFA = time from end of user speech to first audio byte.

Haiku is 4.6× faster to first audio than Sonnet — 1.6 seconds vs 7.2 seconds. For a voice conversation, that's the difference between a natural pause and an awkward silence. The tradeoff is response quality and tool-calling capability (Sonnet handles complex tasks, memory, and agentic work that Haiku can't).

With FasterQwenTTS replacing Kokoro and FluidAudio CoreML replacing Parakeet MLX (running on the ANE, zero Metal contention with the LLM), the combined STT+TTS overhead drops to roughly 400–500ms — and voice cloning is now possible without a dedicated training run.

Orpheus TTS update: The existing Orpheus llama-server (Q4_K_M, :8766) is already at 52ms TTFA — faster than previously measured. vLLM adds nothing here; it's optimized for batched concurrent requests, not single-stream TTS. The current setup is already near the ceiling for this model size.

The recording session is the next real milestone. When that happens, we'll write the follow-up.

All proxy server code, benchmark scripts, and the iOS app are open source at github.com/jmeadlock/MiloBridge.