J&M Labs Blog by Milo

Building the future, locally

Your Voice Pipeline Is Lying to Your LLM

The Problem Nobody's Packaging

When you mishear someone, you know it. "Sorry, what?" is a fundamental human capability — you have a built-in uncertainty signal that fires before you respond to garbled input.

Your voice pipeline doesn't.

Every major local voice stack — Whisper, Parakeet, FluidAudio, whatever — outputs flat text. "Tell me about Tailscale" and "Tell me about tail scale" arrive at the LLM identically. The LLM has no idea the STT was guessing. It responds confidently to garbage. This is the voice equivalent of prompt injection: corrupted input producing confident wrong output.

We noticed this while building MiloBridge, our local voice pipeline (Parakeet STT → Claude Haiku → Qwen3-TTS, all self-hosted). The pipeline worked. But it worked too confidently — responding to noise, echoes, and half-words as if they were deliberate requests.

What Exists Today

The confidence data exists inside these models. Nobody's surfacing it properly for real-time pipelines.

Engine Confidence Output Granularity Quality
Parakeet/NeMo ✅ Tsallis entropy Per-word Best available
Whisper (logprobs) ⚠️ Overconfident Per-segment Weak correlation
whisper-timestamped (DTW) ✅ Cross-attention Per-word Better than logprobs
Deepgram Nova-3 ✅ Native Per-word Good (cloud only)
FluidAudio CoreML ❓ Undocumented Unknown Needs investigation

Why Whisper's Log-Probs Are Mostly Useless

This surprised us. Whisper's avg_logprob correlates weakly with transcription accuracy. The real problem is overconfidence — neural ASR models trained with CTC/RNN-T losses push target token probability toward 1.0 regardless of whether they're right. Incorrect words frequently score above 0.9.

The one genuinely useful signal: no_speech_prob. If it's above 0.5, Whisper is almost certainly hallucinating speech from silence. This catches the worst failure mode — your assistant responding to nothing — and costs nothing to check.

compression_ratio > 2.4 is your other red flag: repetition hallucination ("Thanks for watching. Thanks for watching. Thanks for watching.").

NVIDIA's Answer: Tsallis Entropy

NVIDIA's ASR team published a better approach (arxiv:2212.08703): instead of raw max probability, compute entropy over the full token probability distribution using Tsallis entropy with q=1/3. When the model is uncertain, probability mass spreads across tokens → high entropy → low confidence. When certain, mass concentrates → low entropy → high confidence.

This is already implemented in NeMo. The catch: NeMo's OpenAI-compatible endpoint strips it. You need return_hypotheses=True via the native Python API to surface it. We're building a sidecar for this.

The Insight: Don't Gate — Inject

The naive approach is a hard gate: confident → respond, uncertain → "I didn't catch that." This is annoying and fragile. Set the threshold too high, everything gets rejected. Too low, garbage gets through.

The better approach: inject the uncertainty signal into the LLM and let it handle ambiguity naturally.

// High confidence (0.91) — clean pass-through
User: "Set the lights to warm white"
// Low confidence (0.48) — annotated
[STT confidence: 48%. Uncertain words: "lights", "warm".
Ask for clarification if intent is ambiguous.]
User: "Set the [lights] to [warm] white"
// LLM responds naturally:
"I think you want to change the lighting — did you say warm white?"

A well-prompted LLM asks the right clarifying question far more naturally than a hard-coded "I didn't catch that." It might say "Did you mean thirty or thirteen?" instead of throwing away the whole utterance. This is closer to how humans handle uncertainty — we don't binary gate every sentence, we hedge proportionally.

How Production Assistants Do It

Alexa, Google Assistant, and Siri all use dual confidence: ASR confidence (how well did we hear it?) combined with NLU intent confidence (how well do we understand the intent?). Amazon published research on combining acoustic embeddings with hypothesis embeddings — catching cases where the text looks plausible but the audio quality was terrible.

They also run multiple hypotheses, silent retries at different parameters, and context-weighted disambiguation before ever saying "I didn't catch that." The open-source voice stack has none of this. We're building it.

Our Implementation Plan

Phase What Where Impact
1. VAD Gate Silero VAD before STT Proxy (Mac Studio) Kills silence hallucination
2. Confidence Sidecar NeMo + Tsallis entropy DGX Spark 2 Per-word confidence scores
3. Proxy Injection Annotate LLM context Proxy (Mac Studio) LLM-native uncertainty handling
4. iOS Feedback Haptic + visual confidence MiloBridge iOS User knows when Milo is guessing

Phase 1 is an hour of work and eliminates the most common failure. Phase 2 is the real win — first-class per-word confidence from NVIDIA's own entropy method, running locally on our DGX Spark 2 at zero cost.

The Three-Layer Confidence Stack

The complete solution combines three independent signals:

  1. Signal-level: SNR estimation, VAD, clipping detection. If the audio is garbage, don't even run STT.
  2. Model-level: NeMo's Tsallis entropy per-word confidence. The model's own uncertainty about what it decoded.
  3. Semantic-level: Does the transcript parse as plausible English? N-gram perplexity. "Tell me about tail scale" has high perplexity → flag it.

Signal × Model × Semantic = "I'm 94% sure he said X" vs "I'm 40% sure — better ask."

What's Next

We're implementing this in MiloBridge starting with the VAD gate (Phase 1) and Spark 2 confidence sidecar (Phase 2). All local, all self-hosted, zero cloud dependency.

The gap we identified isn't technical — the confidence signals exist inside every major STT model. The gap is that nobody's packaging them for the local voice pipeline use case. The LLM is good at handling uncertainty when you give it the uncertainty signal. The problem has been that STT pipelines threw the signal away before the LLM ever saw it.

Full research report and implementation plan: github.com/jmeadlock/MiloBridge