VoxBar Kyutai Deep Dive — Mimi Neural Codec & Frame-by-Frame Streaming

The Model: Kyutai STT 1B

Kyutai STT 1B was developed by the Kyutai research lab in Paris — the same team behind the Moshi multimodal model. It's a decoder-only transformer that processes audio through a neural codec, rather than the traditional waveform → spectrogram → encoder pipeline used by most ASR models.

This codec-first approach is what makes Kyutai fundamentally different from every other engine in the VoxBar lineup.

The Mimi Neural Audio Codec

At the heart of Kyutai STT is Mimi — a neural audio codec that compresses raw audio into discrete token streams at 12.5 Hz. Unlike traditional audio compression (which tries to preserve waveform fidelity), Mimi encodes speech into two parallel streams:

Semantic tokens — meaning and word content, similar to text embeddings
Acoustic tokens — speaker identity, prosody, background characteristics

This dual-stream approach means the model simultaneously understands what is being said (semantic) and how it's being said (acoustic). The semantic stream drives transcription, while the acoustic stream helps with speaker disambiguation and noise robustness.

Zero-Delay Streaming Architecture

The Kyutai STT 1B model uses a zero-delay alignment between its 32 audio codebook streams and the text output stream. Unlike multi-modal Moshi models that use staggered delays between codebooks, the STT variant aligns all streams at offset zero — meaning text tokens can emerge from the very first audio frame with no pipeline fill time.

The result: the model produces text tokens almost immediately as audio arrives. It's not waiting for a 2–5 second chunk like Whisper or Canary. It processes audio frame-by-frame at 12.5 Hz, producing text with sub-80ms latency. The model uses greedy decoding (temperature 0.0) — no sampling randomness, just the highest-confidence prediction at every step.

Semantic VAD (Voice Activity Detection)

Traditional VAD systems detect whether there's audio energy present in a signal. Kyutai's VAD goes deeper — it operates on the semantic token stream, detecting whether the audio contains actual speech content rather than just noise or breath sounds.

This semantic-level VAD is remarkably resilient. It continues working during pauses, background noise, and even brief interruptions. When a true silence is detected, the staleness monitor kicks in and auto-recovers the session — you don't need to stop and restart manually.

Self-Recovery & Session Health

One of the more unusual features in VoxBar Kyutai is the staleness monitor. Because Kyutai is a streaming model, it can occasionally stall if the audio pipeline desynchronises. The staleness monitor detects this condition automatically and triggers a transparent reconnection — text output resumes within a second, often without the user even noticing.

Memory Stability — Runs Indefinitely

A streaming model that runs continuously needs rock-solid memory management. VoxBar Kyutai achieves this through two mechanisms:

Fixed-capacity attention cache — the model's KV cache is designed as a circular ring buffer, pre-allocated at startup with a fixed capacity. As new audio frames arrive, they overwrite the oldest entries. The cache never grows beyond its initial allocation, regardless of how long the session runs.
Inference-optimised GPU execution — following Kyutai's official inference pattern, all model operations run in a lightweight execution mode with no unnecessary overhead. This keeps GPU memory usage minimal and predictable throughout the entire session.

The result: VRAM usage locks at ~2.4 GB after model load and stays completely flat — verified across extended sessions with continuous audio processing. No memory leaks, no gradual slowdowns, no eventual crashes.

System Audio & Overlay Mode

VoxBar Kyutai supports the full VoxBar feature set: system audio capture (transcribe your PC's output directly), Overlay Mode with adjustable transparency and font sizes, mid-text insertion, and voice commands.

The frame-by-frame streaming makes Overlay Mode particularly satisfying — text appears character by character as you speak, with virtually no perceptible delay.

Limitations

Kyutai STT 1B has narrower language coverage than other VoxBar engines: it supports English and French only. It also requires an NVIDIA GPU — there is no AMD or CPU fallback. The CC-BY 4.0 license allows commercial use with attribution.

Technical Specifications

Model: Kyutai STT 1B
Architecture: Decoder-only transformer + Mimi neural audio codec
Codec: Mimi — dual semantic + acoustic token streams at 12.5 Hz
Technique: Zero-delay streaming with greedy decoding (temp 0.0)
Languages: English, French
Latency: <80ms
VRAM: ~2.4GB (fixed, never grows) — NVIDIA only
License: CC-BY 4.0

Who Is It For?

VoxBar Kyutai is the best choice if latency is everything. It's ideal for live streaming, live captioning, fast dictation, and scenarios where seeing your words appear instantly matters. If you need more languages or higher accuracy, consider VoxBar Pro or VoxBar AI.

VoxBar Kyutai: Mimi Codec & Frame-by-Frame Streaming