What This Is
Milo needs a voice. Not a generic TTS voice — a real one. The plan is to fine-tune a voice model on James's audio and deploy it as the primary TTS for the MiloBridge voice pipeline: real-time, local, zero cloud dependency.
Will Prowse is the test case. His voice is clean, expressive, and he has hundreds of hours of podcast-quality audio publicly available. If the system can clone Will convincingly, it can clone anyone with a decent corpus.
The model is Qwen3-TTS-12Hz-1.7B — Alibaba's open TTS architecture that uses a 12Hz codec and 16-layer codec head to generate audio autoregressively. Fine-tuning injects a new speaker identity into a specific embedding slot rather than modifying the architecture.
Training Configuration — v8
Version History — What Failed and Why
| Run | LR | Epochs | Result | Status |
|---|---|---|---|---|
| v1–v2 | varied | — | FasterQwen3TTS (0.6B) + LoRA adapter approach. Wrong architecture — v8 uses full 1.7B model with merged weights, not LoRA delta. | dead end |
| v3 | unknown | 25 | Epoch-3 checkpoint was the only one that ever produced valid audio. Higher epochs degraded. Set the benchmark for what "working" sounds like. | reference |
| v4 | — | — | Data pipeline only — produced train_with_codes.jsonl (361 samples). No training output; used as data source for v5+. | data |
| v5 | 1e-4 | ~10 | Catastrophic overfit. Loss collapsed, output became noise. lr=1e-4 is too aggressive for a 361-sample corpus. | overfit |
| v6 | 1e-5 | unknown | Produced a 15MB output file but garbage audio. Too conservative — model didn't shift far enough from the base in available epochs. | underfit |
| v7 | — | — | Intermediate experiment, results not documented. | skipped |
| v8 | 2e-5 | 25/50 | Loss stable 9.7–10.8, no collapse, inference confirmed working. RTF 1.15, 7.28s audio generated clean. Current production candidate. | current |
The Inference Fix That Took Too Long
Every test before v8 failed at inference with HF repo format mismatch.
The training script (sft_12hz.py) copies the base model into each
checkpoint directory with shutil.copytree(), but config.json
inside retains the original _name_or_path field pointing to the
HuggingFace registry slug:
The correct call — no patches to config.json needed:
FasterQwen3TTS
(the 0.6B model) with PeftModel.from_pretrained() for LoRA adapter loading.
v8 checkpoints are full merged model weights — the LoRA approach is wrong
and will produce garbage or fail outright. Use Qwen3TTSModel directly.
Results
Epoch 24 checkpoint loads clean, generates audio without errors. RTF 1.15 without flash-attn means generation is slightly slower than real-time — acceptable, and will drop below 1.0 once flash-attn is installed.
The voice quality question is still open — "working" means the pipeline runs without errors and produces valid audio. Whether it actually sounds like Will Prowse is a human judgment call. If it doesn't, the most likely culprit is reference audio quality, not epoch count or learning rate. Fix the corpus, retrain for 5 epochs.
What's Next
- Evaluate ep-24 audio quality against the v3 ep-3 reference
- If voice character is wrong: audit
will-corpus-v4for noise, compression artifacts, and speaker consistency - If voice character is right: build a FastAPI inference server wrapping
gen_will_v8.py, wire into MiloBridge v2 as the primary TTS endpoint - Record James's voice corpus (30–60 min, scripted) for Milo's actual production voice
- LoRA fine-tune Cindy's voice using the same pipeline
~/clawd/projects/voice-pipeline/VOICE-TRAINING-PLAYBOOK.md.