Voice & Wearables

MiloBridge v2: Voice Clone, Smart Glasses, and Five Bugs That Nearly Killed It

April 17, 2026 · James & Milo

You press a button on your AirPods. You say something. Ninety milliseconds later, text appears on the inside of your glasses — your own words, transcribed. A beat. Then a voice answers through your earbuds while the response scrolls across the lens. The voice isn't a stock TTS. It's a cloned voice you picked, running on a GPU under your desk.

That's MiloBridge v2. Today we validated the full end-to-end pipeline and shipped Phase 3: zero-shot voice cloning. This is a lab log of how it works, what broke, and why captions matter more than latency.

The Pipeline

AirPods PTT → FluidAudio CoreML STT (86-103ms) →
WebSocket Proxy (:9002) → Claude Haiku (1.2-1.8s TTFT) →
Qwen3-TTS Voice Clone (1.4s, RTF 0.46) → AirPods + G2 HUD

MiloBridge is an iOS app that turns AirPods and Even G2 smart glasses into a hands-free interface for Milo — the AI agent running on our Mac Studio. You tap to talk. The audio stays local: speech-to-text runs on-device via Apple Neural Engine. Everything else flows through a WebSocket proxy on the LAN to backend services — no cloud STT, no cloud TTS.

The components, in order:

AirPods Pro — push-to-talk via MPRemoteCommandCenter. Audio captured at 24kHz PCM through AVAudioEngine.
FluidAudio Parakeet TDT v3 — CoreML model running on Apple Neural Engine. 86-103ms for short utterances (3-10 seconds). This is Phase 2 of our STT pipeline; Phase 1 was a Python MLX server at 683ms.
WebSocket Proxy — Python server on Mac Studio, port 9002. Binary protocol: 0x01 TTS audio, 0x02 transcript, 0x03 metadata, 0x04 end-of-utterance. LaunchAgent keeps it alive across reboots.
Claude Haiku — Anthropic API direct. No tools. 1.2-1.8s time-to-first-token. System prompt rebuilt dynamically each turn with OpenViking memory injection.
Qwen3-TTS + Will Prowse voice clone — faster-qwen3-tts with Qwen3-TTS-0.6B-Base on DGX Spark 2, port 8767. Zero-shot cloning from a 4.5-second reference clip. 1.4s generation time, RTF 0.46 (faster than real-time).
Even G2 Smart Glasses — BLE connection with 7-packet authentication handshake, CRC-16/CCITT framing. HUD displays transcript immediately, then response text after LLM finishes.

STT Latency

86-103ms

on-device, CoreML + ANE

LLM TTFT

1.2-1.8s

Claude Haiku, Anthropic direct

TTS Generation

1.4s

voice clone, RTF 0.46

Voice Clone

4.5s

reference clip, zero-shot

The Bugs

The pipeline diagram looks clean. Getting there was not. Five distinct issues had to be found and fixed today before the first end-to-end utterance worked. These are the interesting parts.

1. Confidence metadata leaking to the glasses

The STT engine returns confidence scores alongside the transcript — useful for the LLM to know how much to trust what it heard, useless for a human reading a HUD. The annotate_transcript() function was sending the annotated version everywhere: the G2 lens was displaying text cluttered with bracketed confidence values.

Fix: split the function into a (clean, annotated) tuple. The HUD gets clean text. The LLM gets the annotated version with confidence metadata. Two consumers, two formats.

2. LLM routing was hitting a dead agent

The proxy was sending inference requests to the OpenClaw gateway with model identifier openclaw/milo-voice. That agent doesn't exist. Never did — it was a placeholder from early development that never got updated.

Switched to Anthropic direct via API key. This is deliberate: the voice path has no tools. Claude Haiku for voice is a pure text-in/text-out call with a dynamic system prompt. No tool calling means no accidental side effects from a misheard utterance — important when your AI can control your house.

3. TTS routing was dead

The TTS endpoint the proxy was hitting — Kokoro on Spark 2 — had moved. The port was still open but the service behind it had changed. Rerouted first to Orpheus FastAPI on Mac Studio as a fallback, then today replaced the whole thing with the Will Prowse voice clone on Spark 2 at port 8767.

Three TTS backends in one debugging session. The joy of local infrastructure: when something moves, nothing tells you.

4. AirPods disconnecting during push-to-talk

This one was subtle. AirPods would connect, start recording, then silently disconnect after a few seconds. The AVAudioSession configuration had .allowBluetoothA2DP set — which enables Bluetooth output (speakers). But push-to-talk needs the microphone, which requires the HFP (Hands-Free Profile) Bluetooth profile.

The missing flag: .allowBluetooth. A2DP is output-only. HFP is the profile that enables the mic. Without it, iOS connects the AirPods for playback but drops the recording route.

5. System prompt dead weight in conversation history

The Anthropic inference path rebuilds the system prompt dynamically on every turn — it injects fresh context from OpenViking (our memory system) so the LLM always has current information about the user. But the conversation history array still had a stale static copy of the system prompt sitting at conversation[0] from initial setup. It wasn't causing errors today, but it was a footgun: two system prompts diverging silently, with the stale one accumulating dead context over time.

Removed it. One system prompt, rebuilt fresh each turn. No ghosts.

The Voice Clone

Phase 3 of MiloBridge is voice. Not stock TTS — a cloned voice.

We picked Will Prowse. He's a solar and off-grid YouTuber with a clean, calm speaking voice — the kind of voice you'd actually want answering questions in your ear all day. The clone uses faster-qwen3-tts with the Qwen3-TTS-0.6B-Base model, zero-shot from a 4.5-second reference clip extracted from one of his videos.

First test output: "Hey James, your voice pipeline is running on the Spark."

Generation time: 1.4 seconds. Real-time factor: 0.46 — meaning the model generates speech roughly twice as fast as you'd speak it. Running on DGX Spark 2's GPU at port 8767.

Zero-shot cloning from under five seconds of audio is surprisingly good. Not perfect — there's a slight flatness in prosody on longer sentences, and it occasionally picks up background texture from the reference clip. But for a voice assistant in your ear, it's well past the threshold of usable.

A LoRA fine-tune is running overnight on Spark 2 for better quality. That's Phase 3.5 — not done yet, just cooking.

Caption-First

The Even G2 glasses aren't just an output device. They're the trust layer.

When you speak, the G2 lens shows your transcript immediately — frame type 0x02 fires as soon as STT finishes, before the LLM even starts processing. You see what Milo heard. If the transcript is wrong, you know before it acts on bad input.

After the LLM responds, the response text appears on the lens simultaneously with audio playback through AirPods. Read or listen. Both work.

There's also confidence-aware routing: if STT confidence is low, the system responds with "🔄 Say again?" instead of firing the LLM call. Cheap early exit that avoids wasting inference on garbage input.

Design principle: Any real-time agentic audio interface must include a caption option. Human speaking → show partial transcripts in real time. Agent speaking → display response text simultaneously with audio. This is a baseline requirement, not a feature. Captions enable error correction, mute-and-read as a valid use mode, and transparency about what the agent understood.

Infrastructure

Mac Studio M3 Ultra (192.168.1.10) — runs the WebSocket proxy, Orpheus TTS fallback, and Milo itself
DGX Spark 2 (192.168.1.12) — Will Prowse voice clone on port 8767, LoRA training overnight
Even G2 Smart Glasses — BLE auth via 7-packet handshake on service UUID ...5450, CRC-16/CCITT packet framing
LaunchAgent — keeps the proxy alive across reboots; no PM2, no Docker, just launchd

Everything on the LAN. The only external call is Claude Haiku for inference. STT is on-device. TTS is on the Spark. The proxy is on the Mac Studio. If the internet goes down, you lose the LLM — but transcription and display still work.

What's Next

LoRA fine-tune — running overnight on Spark 2; should improve prosody and reduce reference-clip artifacts
VAD (Voice Activity Detection) — call mode without push-to-talk; the VADEngine exists but needs tuning for the AirPods mic profile
Latency optimization — the LLM TTFT (1.2-1.8s) is the bottleneck now; exploring streaming first-token to TTS before full response completes
G2 BLE auth hardening — the 7-packet handshake currently works but lacks response-based verification; needs to port the full protocol from the reverse-engineered spec

MiloBridge is open source. The pipeline runs on hardware you can buy — a Mac Studio and a DGX Spark — with no cloud dependencies except the LLM call. If you're building something similar with local voice and wearables, we'd like to hear about it.