VoxBar Kyutai: Mimi Codec & Frame-by-Frame Streaming
How Kyutai's neural audio codec processes speech at 12.5 Hz, and why the delayed-streams architecture makes it faster than any chunk-based model.
The Model: Kyutai STT 1B
Kyutai STT 1B was developed by the Kyutai research lab in Paris — the same team behind the Moshi multimodal model. It's a decoder-only transformer that processes audio through a neural codec, rather than the traditional waveform → spectrogram → encoder pipeline used by most ASR models.
This codec-first approach is what makes Kyutai fundamentally different from every other engine in the VoxBar lineup.
The Mimi Neural Audio Codec
At the heart of Kyutai STT is Mimi — a neural audio codec that compresses raw audio into discrete token streams at 12.5 Hz. Unlike traditional audio compression (which tries to preserve waveform fidelity), Mimi encodes speech into two parallel streams:
- Semantic tokens — meaning and word content, similar to text embeddings
- Acoustic tokens — speaker identity, prosody, background characteristics
This dual-stream approach means the model simultaneously understands what is being said (semantic) and how it's being said (acoustic). The semantic stream drives transcription, while the acoustic stream helps with speaker disambiguation and noise robustness.
Zero-Delay Streaming Architecture
The Kyutai STT 1B model uses a zero-delay alignment between its 32 audio codebook streams and the text output stream. Unlike multi-modal Moshi models that use staggered delays between codebooks, the STT variant aligns all streams at offset zero — meaning text tokens can emerge from the very first audio frame with no pipeline fill time.
The result: the model produces text tokens almost immediately as audio arrives. It's not waiting for a 2–5 second chunk like Whisper or Canary. It processes audio frame-by-frame at 12.5 Hz, producing text with sub-80ms latency. The model uses greedy decoding (temperature 0.0) — no sampling randomness, just the highest-confidence prediction at every step.
Semantic VAD (Voice Activity Detection)
Traditional VAD systems detect whether there's audio energy present in a signal. Kyutai's VAD goes deeper — it operates on the semantic token stream, detecting whether the audio contains actual speech content rather than just noise or breath sounds.
This semantic-level VAD is remarkably resilient. It continues working during pauses, background noise, and even brief interruptions. When a true silence is detected, the staleness monitor kicks in and auto-recovers the session — you don't need to stop and restart manually.
Self-Recovery & Session Health
One of the more unusual features in VoxBar Kyutai is the staleness monitor. Because Kyutai is a streaming model, it can occasionally stall if the audio pipeline desynchronises. The staleness monitor detects this condition automatically and triggers a transparent reconnection — text output resumes within a second, often without the user even noticing.
Memory Stability — Runs Indefinitely
A streaming model that runs continuously needs rock-solid memory management. VoxBar Kyutai achieves this through two mechanisms:
- Fixed-capacity attention cache — the model's KV cache is designed as a circular ring buffer, pre-allocated at startup with a fixed capacity. As new audio frames arrive, they overwrite the oldest entries. The cache never grows beyond its initial allocation, regardless of how long the session runs.
- Inference-optimised GPU execution — following Kyutai's official inference pattern, all model operations run in a lightweight execution mode with no unnecessary overhead. This keeps GPU memory usage minimal and predictable throughout the entire session.
The result: VRAM usage locks at ~2.4 GB after model load and stays completely flat — verified across extended sessions with continuous audio processing. No memory leaks, no gradual slowdowns, no eventual crashes.
System Audio & Overlay Mode
VoxBar Kyutai supports the full VoxBar feature set: system audio capture (transcribe your PC's output directly), Overlay Mode with adjustable transparency and font sizes, mid-text insertion, and voice commands.
The frame-by-frame streaming makes Overlay Mode particularly satisfying — text appears character by character as you speak, with virtually no perceptible delay.
Limitations
Kyutai STT 1B has narrower language coverage than other VoxBar engines: it supports English and French only. It also requires an NVIDIA GPU — there is no AMD or CPU fallback. The CC-BY 4.0 license allows commercial use with attribution.
Technical Specifications
- Model: Kyutai STT 1B
- Architecture: Decoder-only transformer + Mimi neural audio codec
- Codec: Mimi — dual semantic + acoustic token streams at 12.5 Hz
- Technique: Zero-delay streaming with greedy decoding (temp 0.0)
- Languages: English, French
- Latency: <80ms
- VRAM: ~2.4GB (fixed, never grows) — NVIDIA only
- License: CC-BY 4.0
Who Is It For?
VoxBar Kyutai is the best choice if latency is everything. It's ideal for live streaming, live captioning, fast dictation, and scenarios where seeing your words appear instantly matters. If you need more languages or higher accuracy, consider VoxBar Pro or VoxBar AI.