How VoxBar built 15 engines, tested 8 head-to-head, and discovered what the benchmarks don't tell you.
Why We Did This
If you search for "best open-source speech-to-text model" today, you'll find benchmark papers, Hugging Face leaderboards, and scattered Reddit threads. What you won't find is someone who actually built working applications around every serious contender and tested them all under identical, real-world conditions.
That's because it's a ridiculous amount of work.
We know — we did it anyway.
VoxBar started as a single transcription app powered by one model. But we kept running into the same question our customers would ask: "Which model is actually best for me?" The benchmark papers couldn't answer that. Word Error Rate on LibriSpeech doesn't tell you what happens when a German philosopher says "simulacrum" at full speed, or when a British podcaster drops profanity mid-sentence, or when someone reads a list and the model can't count to seven.
So we decided to find out for ourselves.
The Search: 15 Models, 6 Architectures
We scoured every major model hub — Hugging Face, NVIDIA NGC, GitHub — for English-capable speech-to-text models with open or permissive licensing. Our criteria were simple:
- Must run locally — no cloud APIs, no data leaving the machine
- Must have a usable license — Apache 2.0, MIT, CC-BY-4.0, or equivalent
- Must be a serious contender — published by a real research lab, not a weekend project
We found 15 models worth building. If there were more, we would have built them — but these are genuinely all the English STT models currently available that meet our bar.
| # | Model | Creator | Size | Architecture | License | |---|-------|---------|------|-------------|---------| | 1 | Voxtral Realtime 4B | Mistral AI | 4B | WebSocket Streaming | Apache 2.0 | | 2 | Kyutai STT 2.6B | Kyutai (Paris) | 2.6B | Streaming Token-by-Token | CC-BY-4.0 | | 3 | Canary Qwen 2.5B | NVIDIA | 2.5B | Chunked Generative (SALM) | CC-BY-4.0 | | 4 | Qwen3-ASR 1.7B | Alibaba (Qwen) | 1.7B | Chunked Generative | Apache 2.0 | | 5 | GLM-ASR-Nano 1.5B | Z.ai (Zhipu AI) | 1.5B | Chunked Generative | Apache 2.0 | | 6 | Kyutai STT 1B | Kyutai (Paris) | 1B | Streaming Token-by-Token | MIT | | 7 | Canary 1B Flash | NVIDIA | 1B | Chunked Encoder-Decoder | CC-BY-4.0 | | 8 | Distil-Whisper V3 | OpenAI (distilled) | 756M | Chunked Encoder-Decoder | MIT | | 9 | Faster-Whisper Large-V3 | OpenAI | 1.5B | Chunked Encoder-Decoder | MIT | | 10 | Faster-Whisper (base) | OpenAI | ~244M | Chunked Encoder-Decoder | MIT | | 11 | Parakeet TDT 0.6B v2 | NVIDIA | 0.6B | Chunked CTC | CC-BY-4.0 | | 12 | Nemotron Speech 0.6B | NVIDIA | 0.6B | Chunked CTC | NVIDIA OML | | 13 | Qwen3-ASR 0.6B | Alibaba (Qwen) | 0.6B | Chunked Generative | Apache 2.0 | | 14 | SenseVoice-Small | Alibaba (FunAudioLLM) | 234M | Chunked CTC | Apache 2.0 | | 15 | Moonshine v2 | Useful Sensors | ~50M | Event-Driven Streaming | MIT |
We built a full desktop application around every single one of them. Same UI, same audio pipeline, same start/stop behaviour. The only thing that changes between VoxBar Pro and VoxBar Moonshine is the engine file.
The Arena: Equal Conditions, Real Content
Academic benchmarks use clean, pre-recorded datasets like LibriSpeech — studio-quality audio, read from scripts, with no accents, no profanity, no interruptions. They measure Word Error Rate on audio that sounds nothing like what people actually transcribe.
We wanted something different. We picked five YouTube clips that represent what real users actually throw at a transcription tool:
| Test | Content | Why This Matters | |------|---------|-----------------| | T1: Reading a List | Motivational — 7 numbered items | Can the model count? Handle structured content? | | T2: Lecture | Yanis Varoufakis on economics | Numbers ($35 trillion, 85%, 1%), proper nouns, complex sentences | | T3: Podcast | Richard Dawkins on Darwin | Nested philosophy, multiple proper nouns (Dennett, Wallace, South America) | | T4: Accent (German) | Joscha Bach — AI philosophy | Dense vocabulary + accent: "simulacrum", "retina", "virtual reality" | | T5: Accent (American) | Daniel Dennett — religion | Narrative, stream-of-consciousness, "hallucinatory presence" |
The Dual-Mode Rule
Every test was run twice:
System Audio (WASAPI loopback) — We played the exact same YouTube video through the system audio capture. Every model heard the identical audio signal. This is the "controlled experiment" — perfectly repeatable. Anyone can reproduce our results by playing the same clips.
Live Microphone — We played the same clip through speakers and captured it with a desk mic. Same room, same distance, same volume. This is the "real-world test" — it shows how the model handles room acoustics, speaker resonance, and the kind of audio quality most users actually deal with.
This dual-mode approach revealed something the benchmarks never show: some models are dramatically better on system audio than mic audio, and vice versa. GLM ASR, for example, scores 8.6 on system audio but drops to 7.4 on mic — a 1.2 point gap. Canary 2.5B scores 8.4 on both — perfectly balanced. That matters for real users.
The Hardware
Every test ran on the same machine: - GPU: NVIDIA RTX 4080 SUPER (16GB VRAM) - OS: Windows 11 - Audio Input A: WASAPI loopback (system audio capture) - Audio Input B: Default desk microphone - All tests same session, same conditions
No cherry-picking. No "best of three." One run per mode, scored as-is.
The Surprise: Size Doesn't Predict Quality
Before testing, we assumed the rankings would roughly follow model size. A 4B model should beat a 2.6B model, which should beat a 1.5B model, and so on.
We were wrong.
Nemotron — 0.6 billion parameters — finished 4th overall, beating Canary at 2.5 billion and GLM at 1.5 billion. It scored 8.7 on system audio, higher than models four times its size. At just ~2GB of VRAM, it offers the best accuracy-per-parameter ratio of anything we tested.
The reason lies in architecture.
Three Architecture Families
Through testing, we discovered that how a model processes audio matters far more than how many parameters it has. Every model we tested falls into one of three families:
1. Streaming (Real-Time) Models like Kyutai and Voxtral process audio frame-by-frame as it arrives. There are no chunks, no boundaries, no overlap problems. The model hears the audio and produces text continuously, like a human listener. The trade-off is that streaming models need to hold context in memory, which means they can hallucinate when the audio goes silent.
2. Chunked (Batch) Models like Whisper, Canary, Nemotron, GLM, and Qwen ASR receive audio in fixed-length segments — typically 2 to 8 seconds — and transcribe each chunk independently. This is simpler to implement but creates a unique class of errors: boundary artifacts. When a sentence spans two chunks, the model may stutter, repeat, or garble the words at the boundary.
3. Event-Driven Moonshine uses a voice activity detector to decide when to transcribe, only processing audio when it detects speech. This works well on paper but struggles with continuous speech because if the VAD misses a section, that audio is lost forever.
What This Means In Practice
The architecture explains almost every error pattern we observed:
| Error Type | Which Architecture | Example | |------------|-------------------|---------| | Boundary stutters | Chunked | "the Supreme Court's The Supreme Court struck down" | | Tail hallucinations | Chunked | "the city of the United States" (Nemotron, when chunk ends mid-word) | | Silence hallucinations | Streaming | Kyutai producing phantom words during pauses | | All-lowercase output | Streaming | Kyutai system audio — model doesn't capitalise | | Repeated endings | Chunked generative | Canary repeating 20 words at end of a test | | Sentence fragmentation | Chunked generative | GLM inserting periods mid-sentence everywhere |
The biggest insight: most "model errors" are actually architecture errors. When we fixed chunking strategy and post-processing, models jumped dramatically in score — without changing a single model weight.
The Tuning Breakthroughs
The first round of raw, untuned scores was sobering. Nemotron scored 5.4. GLM scored 5.4. Models that sounded impressive on paper were producing stuttering, fragmented, borderline-unusable transcripts.
But we noticed something. The content was usually right — the models were hearing the correct words. The mess was in how the output was assembled. Stutters, repeated phrases, sentences shattered into fragments. These weren't model failures. They were engineering failures.
So we started fixing them.
Fix 1: Smart Chunking (Nemotron — 5.4 → 8.35)
Nemotron's original setup used fixed 2-second chunks. The problem: speech doesn't pause every 2 seconds. Words were being cut in half, and the model would hallucinate endings for the half-word it received.
The fix was silence-aware chunking — instead of cutting every 2 seconds, we wait for a natural pause in speech (up to 8 seconds), then cut. This single change eliminated tail hallucinations like "the city of the United States" (which the model generated when a chunk ended on the word "Socrates").
Fix 2: Sliding Window Dedup (Nemotron, GLM)
Chunked models with audio overlap will sometimes transcribe the same phrase twice — once at the end of chunk N, again at the start of chunk N+1. We built a sliding window that compares the last 12 words of the previous output against the first 12 words of the new output, stripping exact and near-exact duplicates.
This killed the "the Supreme Court's The Supreme Court struck down" pattern instantly.
Fix 3: Cross-Sentence Stutter Cleaning (Qwen ASR → GLM → Nemotron)
This was the discovery that changed everything. While fixing Qwen ASR's terrible initial output, we built a stutter cleaner that catches repeated words across sentence boundaries — things like "Pritzker. Pritzker" or "Foreign. Foreign policy".
The original basic pattern only caught word word adjacent repeats. The enhanced version catches repeats separated by punctuation. When we ported this fix from Qwen ASR to GLM and Nemotron, both models immediately improved.
Fix 4: The Loopback Lookahead Buffer (Kyutai 1B)
Kyutai 1B had a bizarre problem: system audio was worse than mic audio. Words like "loneliness" came out as "lonely bliss" and "confront" became "can troll."
The fix was counterintuitive — we added a 400ms buffer that delays processing by less than half a second. This gives the streaming model time to hear the full word before committing to text. System audio went from worst mode to best mode overnight. Every single mishearing was fixed.
Fix 5: Hallucination Spiral Reset (Kyutai 1B)
When Kyutai's streaming model hallucinates, the hallucinated text poisons its own context window (KV cache), causing more hallucinations in a cascade. We built a circuit breaker: if the model hallucinates three times in 30 seconds, we automatically reset its streaming pipeline. The cascade breaks and the model starts fresh.
The Score Jumps
| Model | Before Tuning | After Tuning | Improvement | |-------|:---:|:---:|:---:| | Nemotron 0.6B | 5.4 | 8.35 | +2.95 | | GLM 1.5B | 5.4 | 8.0 | +2.6 | | Kyutai 1B | 7.5 | 8.1 | +0.6 |
Nemotron's jump was the largest — from dead last to 4th place, beating models four times its size. And we didn't change a single model weight. Every fix was in the audio pipeline and post-processing.
The Final Rankings
After tuning every model to its best possible performance, here's where they landed.
System Audio (Controlled — Identical Audio Input)
| Rank | Model | T1: List | T2: Lecture | T3: Podcast | T4: German | T5: American | AVG | |---|---|---|---|---|---|---|---| | 🥇 | VoxBar Pro (4B) | 9.5 | 9.5 | 10.0 | 10.0 | 9.5 | 9.7 | | 🥈 | Kyutai 2.6B | 9.5 | 9.5 | 9.5 | 10.0 | 8.5 | 9.4 | | 🥉 | Nemotron (0.6B) | 8.5 | 8.0 | 9.5 | 9.0 | 8.5 | 8.7 | | 4 | GLM ASR (1.5B) | 8.5 | 8.0 | 9.0 | 9.0 | 8.5 | 8.6 | | 5 | Canary (2.5B) | 7.5 | 8.5 | 8.0 | 9.5 | 8.5 | 8.4 | | 5 | Kyutai 1B | 9.0 | 7.5 | 8.5 | 9.0 | 8.0 | 8.4 | | 7 | Whisper+ (Distil-V3) | 6.5 | 7.5 | 7.0 | 8.5 | 8.0 | 7.5 | | 8 | Qwen ASR (1.7B) | 7.0 | 7.5 | 7.5 | 6.5 | 6.0 | 6.9 |
Microphone Audio (Real-World — Room Acoustics)
| Rank | Model | T1: List | T2: Lecture | T3: Podcast | T4: German | T5: American | AVG | |---|---|---|---|---|---|---|---| | 🥇 | VoxBar Pro (4B) | 9.5 | 9.5 | 10.0 | 10.0 | 8.5 | 9.5 | | 🥈 | Kyutai 2.6B | 9.5 | 9.5 | 9.5 | 10.0 | 8.5 | 9.4 | | 🥉 | Canary (2.5B) | 8.5 | 8.5 | 7.5 | 9.0 | 8.5 | 8.4 | | 4 | Nemotron (0.6B) | 9.0 | 7.0 | 8.0 | 8.5 | 7.5 | 8.0 | | 5 | Kyutai 1B | 9.5 | 7.0 | 7.5 | 7.0 | 8.0 | 7.8 | | 6 | GLM ASR (1.5B) | 7.0 | 6.5 | 8.5 | 8.0 | 7.0 | 7.4 | | 7 | Qwen ASR (1.7B) | 5.0 | 6.5 | 6.5 | 4.5 | 7.0 | 5.9 | | 8 | Whisper+ (Distil-V3) | 6.0 | 5.5 | 5.5 | 6.0 | 6.0 | 5.8 |
Combined
| Rank | Model | System | Mic | Combined | |---|---|---|---|---| | 🥇 | VoxBar Pro (Voxtral 4B) | 9.7 | 9.5 | 9.6 | | 🥈 | Kyutai 2.6B | 9.4 | 9.4 | 9.4 | | 🥉 | Canary (2.5B) | 8.4 | 8.4 | 8.4 | | 4 | Nemotron (0.6B) | 8.7 | 8.0 | 8.35 | | 5 | Kyutai 1B | 8.4 | 7.8 | 8.1 | | 6 | GLM ASR (1.5B) | 8.6 | 7.4 | 8.0 | | 7 | Whisper+ (Distil-V3) | 7.5 | 5.8 | 6.65 | | 8 | Qwen ASR (1.7B) | 6.9 | 5.9 | 6.4 |
So Which Model Should You Use?
The answer depends on your hardware and what you value most.
You have a modern NVIDIA GPU (8GB+ VRAM) and want the best: → VoxBar Pro (Voxtral 4B). It scored highest on every single test. It handles accents, numbers, philosophy, and stream-of-consciousness speech without breaking a sweat. If accuracy is everything, this is the one.
You want near-Pro quality with lower VRAM: → Kyutai 2.6B. Only 0.2 behind Pro in combined score. It tied Pro on accent tests. The 4-second delay is the only trade-off — if you can live with that, it's extraordinary value at ~6GB VRAM.
You have a budget GPU (4-6GB VRAM): → Nemotron 0.6B. The most surprising result of the entire benchmark. A 0.6B model that scores 8.7 on system audio shouldn't be possible, but NVIDIA's CTC architecture is remarkably efficient. Only needs ~2GB VRAM.
You have an AMD GPU or need cross-platform: → Whisper+ (Distil-V3). It's the only model in our lineup that runs on AMD GPUs via ONNX. System audio is respectable at 7.5, but mic audio (5.8) is a known weakness. Still, if you can't run NVIDIA CUDA, this is your best option.
You want real-time (sub-second latency): → Kyutai 1B. Text appears within 0.5 seconds of speaking. No other model in our suite comes close to that speed. Accuracy is solid at 8.1 combined, with lists and structured content being its sweet spot.
The Hall of Memorable Errors
Along the way, we collected some transcription failures that are too good not to share. Every speech model has bad days. These were the worst:
| Model | What It Heard | What Was Actually Said | |-------|--------------|----------------------| | Moonshine | "apples of tea Fear" | "Apple's approach to user experience" | | Canary 1B Flash | "once a demonic" | "wants a demonstration" | | Canary 1B Flash | "and video" | "NVIDIA" | | Moonshine | "signal Point 4" | "560.94" | | Moonshine | "server went down to the bottom corner" | "server went down at about three in the morning" | | Kyutai 1B (pre-tune) | "lonely bliss" | "loneliness" | | Kyutai 1B (pre-tune) | "can troll" | "confront" | | Qwen ASR | "New York" | "a neuron" | | Canary 2.5B | "archeology" | "an armchair" | | Whisper+ | "we're we're" | "we're" (stutter not caught — apostrophe breaks regex) |
These errors are funny, but they're also useful. They tell you exactly where each model's limits are — and they're the reason we tune.
What's Next
This was Round 1 — our tuning and positioning benchmark. We used it to push every model to its limits, figure out our product lineup, and build the rankings for our website.
But we're not done.
Marathon Tests (Round 2) — Every model will be tested for 30+ minutes of continuous use. We'll measure memory stability, hallucination rates over time, VRAM drift, and whether the model degrades after extended sessions. A model that scores 9.5 for two minutes but crashes after twenty isn't ready for real work.
New Models — The open-source speech field is moving fast. When a new model lands on Hugging Face that meets our criteria, we'll build an engine around it, put it through the same five tests under the same conditions, and add it to the leaderboard. No favourites, no sponsorships. Just scores.
Language Testing — Several models in our suite support multiple languages (Whisper supports 99, Canary handles a handful, Voxtral speaks French). We plan to extend the benchmark with non-English tests for models that claim multilingual support.
Why This Matters
There is no independent, practical comparison of open-source speech-to-text models anywhere on the internet. The academic benchmarks use synthetic datasets that don't represent real usage. The model creators publish flattering numbers on their own test sets. Reddit threads are anecdotal.
We built 15 working applications. We tested 8 head-to-head under identical conditions. We tuned each one until we'd found its ceiling. And we published the results — every score, every error, every fix.
If you're building a product that needs speech-to-text, you shouldn't have to guess which model to use. Now you don't have to.
VoxBar is an independent product. We are not affiliated with, endorsed by, or sponsored by any of the model creators mentioned in this article. All models were tested using their publicly available open-source releases. Our rankings reflect real-world performance under our specific test conditions and may differ from results obtained under different hardware, software, or audio configurations.
All testing was performed in February 2026. Rankings will be updated as new models are released and tested.