Why Six Models? Inside VoxBar™'s Multi-Engine Strategy

The One-Size-Fits-All Problem

Most speech-to-text products make a simple bet: pick one model, ship it, charge for it. Otter.ai uses a proprietary cloud model. Dragon uses a single on-device engine. Whisper wrappers use… Whisper.

The problem is that no single model is best at everything. A model that's great for accuracy needs a big GPU. A model that runs on any CPU sacrifices quality. A streaming model that gives you instant text might accumulate state over long sessions. A chunk-based model that runs forever might have higher latency per word.

We tested six engines head-to-head on the same passages — everyday conversation, technical jargon, numbers, homophones, rapid speech, proper nouns. No single model won every test.

The Trade-Off Triangle

Every speech model makes trade-offs between three things:

Accuracy — how correctly it transcribes your words
Speed — how quickly text appears (latency)
Hardware — what GPU/CPU it needs to run

You can optimise for two, but the third always suffers. High accuracy + low latency? You need a big GPU. Low hardware requirements + good accuracy? Latency goes up. Instant results on any CPU? Accuracy drops.

The Lineup

🏆 VoxBar™ Pro — Voxtral 4B ($49)

The flagship. Mistral AI's 4-billion-parameter streaming model delivers the highest accuracy (4.9% WER) with true real-time delivery. It needs 8-12GB VRAM and Docker — the most demanding setup, but the best result.

⭐ VoxBar™ AI — Canary Qwen 2.5B ($29)

NVIDIA's Speech-Augmented Language Model, pairing a FastConformer encoder with Alibaba's Qwen backbone. Chunk-based instead of streaming, which means slightly higher latency per word — but each chunk is fully context-aware. Runs natively (no Docker) on ~6GB VRAM. The "set and forget" option: it runs indefinitely with zero quality degradation.

⚡ VoxBar™ Kyutai — STT 1B ($39)

Kyutai's neural codec model. Frame-by-frame streaming at sub-80ms latency — the fastest text delivery in the lineup. Uses the Mimi codec's dual token streams for a completely different approach to speech recognition. English and French only.

🚀 VoxBar™ Nemotron — 0.6B ($29)

NVIDIA's ultra-lightweight streaming model. The smallest engine at 600M parameters and ~2GB VRAM. Streaming-native via RNNT decoder. Ideal for running alongside games or other GPU-heavy applications. English only.

💪 VoxBar™ Whisper+ — distil-large-v3 ($19)

The best value. OpenAI's Whisper, distilled and tuned with a 3-layer anti-hallucination stack. Works on NVIDIA, AMD, and CPU — the broadest hardware support. 99 languages. The engine for everyone.

🆓 VoxBar™ Free — Whisper base ($0)

Zero barrier to entry. A basic Whisper engine that runs on any computer with no GPU required. Try VoxBar™, see if voice transcription fits your workflow, then upgrade when you're ready.

How We Tested

We ran the same seven test passages through all six engines — everyday conversation, numbers and dates, homophones, technical terms, rapid speech, proper nouns, and medical terminology. Each engine was graded on completeness, accuracy, and error severity.

VoxBar™ Pro scored S tier (near-perfect across all tests). VoxBar™ AI scored A tier after tuning. Whisper+ scored A- tier. The others fell below the quality bar we were comfortable shipping.

Picking the Right One

Best accuracy, any language: VoxBar™ Pro ($49)
Best accuracy, no Docker: VoxBar™ AI ($29)
Fastest text delivery: VoxBar™ Kyutai ($39)
Lowest GPU usage: VoxBar™ Nemotron ($29)
Any hardware, best value: VoxBar™ Whisper+ ($19)
Free, no commitment: VoxBar™ Free ($0)

Six models isn't about complexity. It's about giving you the right tool for your hardware and your budget. One of these engines is the right fit for you — and you only pay once.

Why Six Models?