I Didn't Realize Two-Way Voice Was This Hard

April 5, 2026

We spent a Saturday building MiloBridge — a local voice pipeline running entirely on our own hardware. Push-to-talk, speech recognition, LLM response, voice synthesis. It worked. We were proud.

Then I asked: what if I don't have to hold a button? What if I just... talk to it, like a phone call?

Three hours later I understood why Siri still sounds robotic after thirteen years. And I started noticing a second problem I hadn't even considered — the pipeline was confident about things it had no business being confident about.

Push-to-Talk Is a Solved Problem

The push-to-talk pipeline is actually straightforward once you've wrangled the audio formats:

Hold button → start recording mic
Release button → send audio to STT
Transcript arrives → send to LLM
LLM responds → synthesize speech → play audio

You have natural gates at every stage. The button is your Voice Activity Detection. You know exactly when the user is speaking and when the AI should speak. There's no ambiguity. We built this, it worked, and we wrote about it. Then I wanted more.

The Echo Problem Nobody Warned Me About

The moment you remove the button and let the AI listen continuously, you've introduced a loop. The AI speaks. The mic hears the AI speaking. The STT transcribes the AI's own voice. The transcript goes to the LLM. The LLM responds to itself.

In our first attempt without any echo suppression, Milo answered a question, heard his own answer through the phone speaker, transcribed it, and started responding to his own response. Within two turns he was in a philosophical loop about the nature of voice assistants.

This is a solved problem in telephony — acoustic echo cancellation, around since the 1960s. Tap the audio being played to the speaker, use it as a reference signal, subtract it from what the mic picks up. iOS has this built in: AVAudioSession.Mode.voiceChat activates Apple's hardware AEC. We turned it on. It helped a lot. But "helped" and "solved" aren't the same thing.

The 1.5-Second Problem

Apple's AEC is tuned for human phone calls — two people talking, each adding maybe 50-150ms of latency. Our pipeline has ~1.5 seconds of latency between when Milo starts generating a response and when that audio plays out the speaker. The echo arrives 1.5 seconds after the reference signal. AEC, tuned for millisecond delays, struggles with that gap.

So we added VAD — Voice Activity Detection. A simple energy threshold: if mic level exceeds -50dB, assume the user is speaking. While audio is playing, suppress VAD entirely so the AI can't mishear itself. This mostly worked. Except on speakerphone.

The Speakerphone Disaster

Our entire VAD tuning was done with AirPods. AirPods are great — acoustic isolation, mic close to your mouth, speaker in your ear canal away from the mic. The echo path is short and weak.

Speakerphone was a different universe. The speaker blasts audio into the room. The mic picks up everything. Our -50dB threshold meant Milo's own voice — louder than expected — triggered speech detection while he was still talking. He'd cut himself off mid-sentence, decide he'd been interrupted, and start a new response.

The fix: per-device VAD profiles, selected automatically based on the active audio route:

// AirPods: tight threshold, short echo guard
speechThresholdDB = -52, playbackGuardMs = 300
// Speakerphone: loose threshold, long guard
speechThresholdDB = -38, playbackGuardMs = 700
// Wired headphones: middle ground
speechThresholdDB = -46, playbackGuardMs = 400

It works. But it feels like tape over a wound. The real fix — using Milo's TTS audio output as a reference signal for software AEC — is still on the roadmap. That's what the production assistants do.

Meanwhile, the Pipeline Was Lying With Confidence

While debugging the echo loop, I noticed something else. Even when echo wasn't the issue — when I was speaking clearly, no background noise, AirPods in — the pipeline would occasionally respond to something that wasn't quite what I said. Not a hallucination. More like... it misheard me and then responded as if it hadn't.

Humans don't do this. When you mishear someone, you know it. "Sorry, what?" is a fundamental human capability. You have a built-in uncertainty signal that fires before you respond to garbled input.

Our pipeline didn't have one. Every STT output — whether it heard "set a timer for ten minutes" or had a 40% confident guess at something mumbled — arrived at the LLM as identical flat text. The LLM had no idea it was working from uncertain input.

The Confidence Data Exists — It's Just Thrown Away

This surprised us when we dug into it: the confidence signal exists inside every major STT model. It's just not surfaced for real-time pipelines.

NVIDIA's NeMo (which powers Parakeet, our STT model) has a first-class confidence estimation system. It computes entropy over the full token probability distribution at every decoder step using Tsallis entropy with q=1/3. This gives per-word uncertainty scores — not just "how confident overall" but "confident about 'set', 'timer', 'ten', less sure about 'minutes'".

The catch: NeMo's OpenAI-compatible API endpoint strips this data. You get text and nothing else.

Whisper's built-in log-probabilities are mostly useless for this purpose — neural ASR models trained with CTC/transducer losses are systematically overconfident on incorrect words. The one Whisper signal worth checking: no_speech_prob. If it's above 0.5, Whisper is almost certainly hallucinating speech from silence. That alone is worth catching.

Three Layers of Uncertainty You're Ignoring

Once you start thinking about it, there are actually three independent signals you could be using that most local pipelines throw away:

Signal-level: Is the audio itself good? Run VAD before STT — not just to suppress echo, but to avoid wasting STT on silence or noise. Silero VAD takes ~5ms on CPU and prevents Whisper's worst failure mode: hallucinating plausible speech from silence ("Thanks for watching!").
Model-level: How confident is the STT in what it decoded? Parakeet/NeMo with return_hypotheses=True and Tsallis entropy gives you per-word scores. We're building a sidecar service on our DGX Spark 2 to surface this.
Semantic-level: Does the transcript make sense given context? "Tell me about tail scale" has suspiciously high perplexity mid-conversation — something an n-gram check would catch that the acoustic model never would.

Combine them and you get something like: "I'm 91% confident he said X" vs "I'm 43% confident — probably worth asking."

Don't Gate — Inject

The naive approach is a hard gate: high confidence → respond, low confidence → "I didn't catch that." This is annoying and fragile. Set the threshold wrong and everything breaks.

The better approach is to inject the uncertainty signal into the LLM and let it handle ambiguity naturally:

// High confidence — clean pass-through
User: "Set a timer for ten minutes"
// Low confidence — annotated context
[STT confidence: 44%. Uncertain words: "timer", "ten".
 Ask for clarification if intent is ambiguous.]
User: "Set a [timer] for [ten] minutes"
// LLM responds naturally:
"Did you want a ten-minute timer, or a different length?"

A well-prompted LLM asks the right clarifying question far more naturally than any hard-coded fallback. This is how Alexa and Google actually work — dual confidence scores (ASR + NLU intent), multiple hypotheses, context-weighted disambiguation — before ever playing the "I didn't catch that" audio. The open-source voice stack just doesn't surface any of it.

What We're Building Next

In priority order:

Silero VAD pre-filter in the proxy — eliminates silence hallucination, prevents echo artifacts from becoming STT input. ~1 hour of work.
NeMo confidence sidecar on DGX Spark 2 — FastAPI wrapper around the native Python API with Tsallis entropy, returning per-word scores alongside the transcript. ~4 hours.
Proxy confidence injection — annotate the LLM context with uncertainty before it reaches milo-voice. ~2 hours.
Reference signal AEC — the real echo fix. Tap TTS audio pre-speaker, subtract from mic input. This is where the hardware echo cancellation systems are today.

Why This Is Hard and That's Okay

Phone calls feel effortless because decades of engineering went into making them feel effortless. Dedicated DSP chips, hardware AEC calibrated to specific speaker/mic combinations, network protocols optimized for voice, codecs designed around human hearing. All invisible.

When you build a voice pipeline from scratch, you're not starting from "how do I make this better." You're starting from "why is this catastrophically broken" and working up. That's actually fine. It's part of building.

We have a working voice pipeline. It has real limitations. We know what they are and we're fixing them systematically.

The code is at github.com/jmeadlock/MiloBridge — VAD state machine, per-device echo profiles, and the confidence sidecar as it comes together.

From James

I'm thinking this has gotten more complex than we can manage to get working well... but I'll keep at it for a bit longer and hope to see the end in sight :)

— James