You press a button on your AirPods. You say something. Ninety milliseconds later, text appears on the inside of your glasses — your own words, transcribed. A beat. Then a voice answers through your earbuds while the response scrolls across the lens. The voice isn't a stock TTS. It's a cloned voice you picked, running on a GPU under your desk.
That's MiloBridge v2. Today we validated the full end-to-end pipeline and shipped Phase 3: zero-shot voice cloning. This is a lab log of how it works, what broke, and why captions matter more than latency.
MiloBridge is an iOS app that turns AirPods and Even G2 smart glasses into a hands-free interface for Milo — the AI agent running on our Mac Studio. You tap to talk. The audio stays local: speech-to-text runs on-device via Apple Neural Engine. Everything else flows through a WebSocket proxy on the LAN to backend services — no cloud STT, no cloud TTS.
The components, in order:
MPRemoteCommandCenter. Audio captured at 24kHz PCM through AVAudioEngine.0x01 TTS audio, 0x02 transcript, 0x03 metadata, 0x04 end-of-utterance. LaunchAgent keeps it alive across reboots.faster-qwen3-tts with Qwen3-TTS-0.6B-Base on DGX Spark 2, port 8767. Zero-shot cloning from a 4.5-second reference clip. 1.4s generation time, RTF 0.46 (faster than real-time).The pipeline diagram looks clean. Getting there was not. Five distinct issues had to be found and fixed today before the first end-to-end utterance worked. These are the interesting parts.
The STT engine returns confidence scores alongside the transcript — useful for the LLM to know how much to trust what it heard, useless for a human reading a HUD. The annotate_transcript() function was sending the annotated version everywhere: the G2 lens was displaying text cluttered with bracketed confidence values.
Fix: split the function into a (clean, annotated) tuple. The HUD gets clean text. The LLM gets the annotated version with confidence metadata. Two consumers, two formats.
The proxy was sending inference requests to the OpenClaw gateway with model identifier openclaw/milo-voice. That agent doesn't exist. Never did — it was a placeholder from early development that never got updated.
Switched to Anthropic direct via API key. This is deliberate: the voice path has no tools. Claude Haiku for voice is a pure text-in/text-out call with a dynamic system prompt. No tool calling means no accidental side effects from a misheard utterance — important when your AI can control your house.
The TTS endpoint the proxy was hitting — Kokoro on Spark 2 — had moved. The port was still open but the service behind it had changed. Rerouted first to Orpheus FastAPI on Mac Studio as a fallback, then today replaced the whole thing with the Will Prowse voice clone on Spark 2 at port 8767.
Three TTS backends in one debugging session. The joy of local infrastructure: when something moves, nothing tells you.
This one was subtle. AirPods would connect, start recording, then silently disconnect after a few seconds. The AVAudioSession configuration had .allowBluetoothA2DP set — which enables Bluetooth output (speakers). But push-to-talk needs the microphone, which requires the HFP (Hands-Free Profile) Bluetooth profile.
The missing flag: .allowBluetooth. A2DP is output-only. HFP is the profile that enables the mic. Without it, iOS connects the AirPods for playback but drops the recording route.
The Anthropic inference path rebuilds the system prompt dynamically on every turn — it injects fresh context from OpenViking (our memory system) so the LLM always has current information about the user. But the conversation history array still had a stale static copy of the system prompt sitting at conversation[0] from initial setup. It wasn't causing errors today, but it was a footgun: two system prompts diverging silently, with the stale one accumulating dead context over time.
Removed it. One system prompt, rebuilt fresh each turn. No ghosts.
Phase 3 of MiloBridge is voice. Not stock TTS — a cloned voice.
We picked Will Prowse. He's a solar and off-grid YouTuber with a clean, calm speaking voice — the kind of voice you'd actually want answering questions in your ear all day. The clone uses faster-qwen3-tts with the Qwen3-TTS-0.6B-Base model, zero-shot from a 4.5-second reference clip extracted from one of his videos.
First test output: "Hey James, your voice pipeline is running on the Spark."
Generation time: 1.4 seconds. Real-time factor: 0.46 — meaning the model generates speech roughly twice as fast as you'd speak it. Running on DGX Spark 2's GPU at port 8767.
Zero-shot cloning from under five seconds of audio is surprisingly good. Not perfect — there's a slight flatness in prosody on longer sentences, and it occasionally picks up background texture from the reference clip. But for a voice assistant in your ear, it's well past the threshold of usable.
A LoRA fine-tune is running overnight on Spark 2 for better quality. That's Phase 3.5 — not done yet, just cooking.
The Even G2 glasses aren't just an output device. They're the trust layer.
When you speak, the G2 lens shows your transcript immediately — frame type 0x02 fires as soon as STT finishes, before the LLM even starts processing. You see what Milo heard. If the transcript is wrong, you know before it acts on bad input.
After the LLM responds, the response text appears on the lens simultaneously with audio playback through AirPods. Read or listen. Both work.
There's also confidence-aware routing: if STT confidence is low, the system responds with "🔄 Say again?" instead of firing the LLM call. Cheap early exit that avoids wasting inference on garbage input.
Design principle: Any real-time agentic audio interface must include a caption option. Human speaking → show partial transcripts in real time. Agent speaking → display response text simultaneously with audio. This is a baseline requirement, not a feature. Captions enable error correction, mute-and-read as a valid use mode, and transparency about what the agent understood.
...5450, CRC-16/CCITT packet framingEverything on the LAN. The only external call is Claude Haiku for inference. STT is on-device. TTS is on the Spark. The proxy is on the Mac Studio. If the internet goes down, you lose the LLM — but transcription and display still work.
MiloBridge is open source. The pipeline runs on hardware you can buy — a Mac Studio and a DGX Spark — with no cloud dependencies except the LLM call. If you're building something similar with local voice and wearables, we'd like to hear about it.