VoxBar Pro Deep Dive — Voxtral Mini 4B Realtime Architecture

The Model: Voxtral Mini 4B Realtime

Released by Mistral AI in early 2026, Voxtral Mini 4B Realtime is a ground-up streaming speech-to-text model. Unlike chunk-based systems that wait for a fixed window of audio, Voxtral processes speech as it arrives — producing text tokens in real-time with sub-500ms latency.

The model's 4 billion parameters are split between two components:

0.6B causal audio encoder — converts raw audio into a stream of audio tokens using sliding-window attention. This is the speed bottleneck: it operates frame-by-frame with no look-ahead, enabling true real-time processing.
3.4B language model — a 26-layer transformer with grouped-query attention that converts audio tokens into text. This is where context awareness, punctuation, and capitalisation are handled.

How VoxBar Pro Uses It

VoxBar Pro runs Voxtral inside a Docker container on your machine. The app connects via a local WebSocket — audio goes in, text tokens come back. Nothing touches the internet.

Docker Auto-Management

On first launch, VoxBar Pro automatically detects your GPU vendor (NVIDIA or AMD), pulls the correct Docker image, and starts the container. You don't need to touch a terminal. The app monitors the container's health and restarts it if necessary.

On Windows with NVIDIA, the container uses the NVIDIA Container Toolkit for direct GPU passthrough. On AMD, it uses ROCm via Docker. On Mac, VoxBar Pro runs natively with Apple Metal — no Docker needed at all.

Session Management

Streaming models accumulate state over time. If you talk for hours, the context window fills up and quality degrades. VoxBar Pro solves this with automatic session refresh — it monitors the session health and transparently reconnects before quality drops, giving you effectively unlimited session length with no manual intervention.

System Audio Capture

This is a feature that very few local transcription apps offer: VoxBar Pro can capture your PC's system audio directly, without any virtual cables or third-party software. Switch to system audio mode and it transcribes whatever is playing — Zoom meetings, YouTube videos, podcasts, webinars.

The system audio is processed by the same Voxtral model, locally, with the same privacy guarantees as microphone input. This makes VoxBar Pro a genuine meeting transcription tool that works entirely offline.

Smart Autocorrect

When you stop speaking, VoxBar Pro's smart autocorrect kicks in. It cleans up double spaces, fixes capitalisation after sentence boundaries, removes filler words, and normalises formatting. This is done client-side in the textbox — no extra model inference required.

Overlay Mode

Overlay Mode turns VoxBar Pro into a transparent floating interface that sits on top of any application. Key features:

Adjustable transparency — from fully opaque to nearly invisible
Block and line modes — choose between a multi-line textbox or a single-line entry for different use cases
Mid-text insertion — click anywhere in your transcribed text and speak to insert new text at that position
Voice commands — highlight text and say "delete" to remove it; say punctuation marks to insert them
Font size control — zoom in/out for readability

Technical Specifications

Model: Voxtral Mini 4B Realtime (Mistral AI)
Parameters: ~4B total (3.4B LM + 0.6B encoder)
Architecture: Causal audio encoder → streaming LM with grouped-query attention
WER: 4.9% on FLEURS (English, at 480ms delay)
Languages: 13+ with native multilingual support
VRAM: ~8–12GB recommended
License: Apache 2.0 — fully commercial, no attribution required
Platforms: Windows (NVIDIA/AMD via Docker), macOS (Apple Silicon native)

Who Is It For?

VoxBar Pro is the best choice if you want the highest possible accuracy with true real-time delivery. It's ideal for live presentations, meeting transcription, content creation, and professional dictation. The Docker requirement adds a small setup step on Windows, but the automatic management makes it painless.

If you want similar accuracy without Docker, consider VoxBar AI — it uses NVIDIA's Canary Qwen 2.5B and runs natively.

VoxBar Pro: The Voxtral Deep Dive