Why We Did This
If you search for "best open-source speech-to-text model" today, you'll find benchmark papers, Hugging Face leaderboards, and scattered Reddit threads. What you won't find is someone who actually built working applications around every serious contender and tested them all under identical, real-world conditions.
That's because it's a ridiculous amount of work. We know β we did it anyway.
VoxBarβ’ started as a single transcription app powered by one model. But we kept running into the same question our customers would ask: "Which model is actually best for me?" The benchmark papers couldn't answer that. Word Error Rate on LibriSpeech doesn't tell you what happens when a German philosopher says "simulacrum" at full speed, or when a Greek economist rattles off "$35 trillion" mid-sentence.
So we decided to find out for ourselves.
The Search: 15 Models, 6 Architectures
We scoured every major model hub β Hugging Face, NVIDIA NGC, GitHub β for English-capable speech-to-text models with open or permissive licensing. Our criteria were simple:
- Must run locally β no cloud APIs, no data leaving the machine
- Must have a usable license β Apache 2.0, MIT, CC-BY-4.0, or equivalent
- Must be a serious contender β published by a real research lab
We found 15 models worth building on consumer hardware. There may be a handful of others out there β older models, or projects that didn't quite reach the frontier β but we focused on the ones that are genuinely pushing the boundaries of what's possible right now. With Voxtral and Kyutai leading the pack, and Qwen close behind, we believe we've captured the state of the art.
| # | Model | Creator | Size | License |
|---|---|---|---|---|
| 1 | Voxtral Realtime 4B | Mistral AI | 4B | Apache 2.0 |
| 2 | Kyutai STT 2.6B | Kyutai (Paris) | 2.6B | CC-BY-4.0 |
| 3 | Canary Qwen 2.5B | NVIDIA | 2.5B | CC-BY-4.0 |
| 4 | Qwen3-ASR 1.7B | Alibaba | 1.7B | Apache 2.0 |
| 5 | GLM-ASR-Nano 1.5B | Zhipu AI | 1.5B | Apache 2.0 |
| 6 | Kyutai STT 1B | Kyutai (Paris) | 1B | MIT |
| 7 | Canary 1B Flash | NVIDIA | 1B | CC-BY-4.0 |
| 8 | Distil-Whisper V3 | OpenAI (distilled) | 756M | MIT |
| 9 | Faster-Whisper Large-V3 | OpenAI | 1.5B | MIT |
| 10 | Faster-Whisper (base) | OpenAI | 244M | MIT |
| 11 | Parakeet TDT 0.6B | NVIDIA | 0.6B | CC-BY-4.0 |
| 12 | Nemotron Speech 0.6B | NVIDIA | 0.6B | NVIDIA OML |
| 13 | Qwen3-ASR 0.6B | Alibaba | 0.6B | Apache 2.0 |
| 14 | SenseVoice-Small | Alibaba | 234M | Apache 2.0 |
| 15 | Moonshine v2 | Useful Sensors | 50M | MIT |
We wrapped each one of these engines in our VoxBarβ’ application β same UI, same audio pipeline, same start/stop behaviour β and put it through rigorous tuning and stress testing before ranking it in the arena.
The Arena: Equal Conditions, Real Content
Academic benchmarks use clean, pre-recorded datasets like LibriSpeech β studio-quality audio, read from scripts, with no accents, no interruptions. They measure Word Error Rate on audio that sounds nothing like what people actually transcribe.
We wanted to test under real-world conditions β the same stuff that people actually throw at a transcription tool, captured through a standard desktop webcam mic and through system audio. We picked five YouTube clips:
| Test | Content | Why This Matters |
|---|---|---|
| T1 | Reading a List (Numbers & Words) | Can it count? Handle structured content? |
| T2 | Lecture (Varoufakis β Greek accent) | Numbers ($35 trillion, 85%), proper nouns, accent |
| T3 | Podcast (Dawkins on Darwin) | Nested philosophy, multiple proper nouns |
| T4 | Accent (Joscha Bach β German) | Dense vocab + accent: "simulacrum", "retina" |
| T5 | Accent (Daniel Dennett β American) | Narrative, stream-of-consciousness |
The accent diversity was deliberate. Across the five tests we covered British (T3 β Dawkins), Greek (T2 β Varoufakis), German (T4 β Bach), and American (T5 β Dennett) speakers. If a model can handle all four, it can handle your users.
The Dual-Mode Rule
Every test was run twice:
System Audio (WASAPI loopback) β We played the exact same YouTube video through system audio capture. Every model heard the identical audio signal. This is the "controlled experiment" β perfectly repeatable.
Live Microphone β Same clip played through speakers, captured with a desk mic. Same room, same distance, same volume. This is the "real-world test."
This dual-mode approach revealed something the benchmarks never show: some models are dramatically better on system audio than mic audio, and vice versa. GLM ASR scores 8.6 on system but drops to 7.4 on mic β a 1.2 point gap. Canary 2.5B scores 8.4 on both β perfectly balanced.
The Hardware
- GPU: NVIDIA RTX 4080 SUPER (16GB VRAM)
- OS: Windows 11
- Audio A: WASAPI loopback (system audio)
- Audio B: Default desk microphone
No cherry-picking. No "best of three." One run per mode, scored as-is.
The Surprise: Size Doesn't Predict Quality
Before testing, we assumed the rankings would roughly follow model size. A 4B model should beat a 2.6B model, which should beat a 1.5B model.
We were wrong.
Nemotron β 0.6 billion parameters β finished 4th overall, beating Canary at 2.5 billion and GLM at 1.5 billion. It scored 8.7 on system audio, higher than models four times its size. The reason lies in architecture.
Three Architecture Families
Through testing, we discovered that how a model processes audio matters far more than how many parameters it has:
1. Streaming (Real-Time) β Models like Kyutai and Voxtral process audio frame-by-frame as it arrives. No chunks, no boundaries. They work especially well when you speak deliberately in clear statements, and through system audio they can transcribe near-perfectly. Voxtral in particular produces flawless output at the top tier. The smaller streaming models can occasionally produce phantom words during long pauses, but this is manageable with good pipeline design.
2. Chunked (Batch) β Models like Whisper, Canary, Nemotron, GLM, and Qwen ASR receive audio in fixed segments (2β8 seconds). Simpler to implement, but creates boundary artifacts when sentences span two chunks.
3. Event-Driven β Moonshine uses a voice activity detector to decide when to transcribe. Works on paper, but missed speech is lost forever.
| Error Type | Architecture | Example |
|---|---|---|
| Boundary stutters | Chunked | "the Supreme Court's The Supreme Court struck down" |
| Silence hallucinations | Streaming | Phantom words produced during pauses |
| Repeated endings | Chunked generative | Canary repeating 20 words at end of test |
| Sentence fragmentation | Chunked generative | GLM inserting periods mid-sentence |
| All-lowercase output | Streaming | Kyutai system audio β no capitalisation |
The biggest insight: most of the issues we encountered weren't really model errors at all β they were integration challenges. The models themselves were hearing the right words. The problems were in how the audio was chunked, how overlapping segments were merged, and how the output was assembled. Once we fixed the pipeline and post-processing, models jumped dramatically in score β without changing a single model weight.
The Tuning Breakthroughs
The first round of raw, untuned scores was sobering. Nemotron scored 5.4. GLM scored 5.4. Models that sounded impressive on paper were producing stuttering, fragmented, borderline-unusable transcripts.
But we noticed something. The content was usually right β the models were hearing the correct words. The mess was in how the output was assembled. These weren't model failures. They were engineering failures. So we started fixing them.
A note on honesty: none of these techniques are novel research contributions. Smart chunking, overlap deduplication, and stutter cleaning are well-understood audio engineering practices. What's genuinely new is that nobody β as far as we can find β has applied them systematically across this many models and published the comparative results. We're not claiming to have advanced the science. We did the unglamorous engineering work that makes these models production-ready.
Fix 1: Smart Chunking (Nemotron β 5.4 β 8.35)
Nemotron's original setup used fixed 2-second chunks. Words were being cut in half, and the model would hallucinate endings. The fix: silence-aware chunking β instead of cutting every 2 seconds, we wait for a natural pause (up to 8 seconds). This single change eliminated tail hallucinations.
Fix 2: Sliding Window Dedup (Nemotron, GLM)
Chunked models with audio overlap transcribe the same phrase twice. We built a 12-word sliding window that compares the end of each chunk against the start of the next, stripping duplicates.
Fix 3: Cross-Sentence Stutter Cleaning (Qwen β GLM β Nemotron)
The discovery that changed everything. While fixing Qwen ASR, we built a stutter cleaner that catches repeated words across sentence boundaries β "Pritzker. Pritzker" or "Foreign. Foreign policy". When we ported this to GLM and Nemotron, both improved immediately.
Fix 4: Loopback Lookahead Buffer (Kyutai 1B)
Kyutai 1B had a bizarre problem: system audio was worse than mic. "Loneliness" became "lonely bliss." The fix was counterintuitive β a 400ms buffer that gives the streaming model time to hear the full word before committing. Every mishearing was fixed.
Fix 5: Hallucination Spiral Reset (Kyutai 1B)
When Kyutai hallucinates, the hallucinated text poisons its own context, causing more hallucinations. We built a circuit breaker: three hallucinations in 30 seconds triggers a pipeline reset. The cascade breaks and the model starts fresh.
The Score Jumps
| Model | Before | After | Improvement |
|---|---|---|---|
| Nemotron 0.6B | 5.4 | 8.35 | +2.95 |
| GLM 1.5B | 5.4 | 8.0 | +2.6 |
| Kyutai 1B | 7.5 | 8.1 | +0.6 |
Nemotron jumped from dead last to 4th place, beating models four times its size. Not a single model weight was changed. Every fix was in the audio pipeline and post-processing.
The Final Rankings
After tuning every model to its best possible performance, here's where they landed.
π’ System Audio (Controlled β Identical Audio Input)
| Rank | Model | T1 | T2 | T3 | T4 | T5 | AVG |
|---|---|---|---|---|---|---|---|
| π₯ | VoxBar Pro (4B) | 9.5 | 9.5 | 10.0 | 10.0 | 9.5 | 9.7 |
| π₯ | Kyutai 2.6B | 9.5 | 9.5 | 9.5 | 10.0 | 8.5 | 9.4 |
| π₯ | Nemotron (0.6B) | 8.5 | 8.0 | 9.5 | 9.0 | 8.5 | 8.7 |
| 4 | GLM ASR (1.5B) | 8.5 | 8.0 | 9.0 | 9.0 | 8.5 | 8.6 |
| 5 | Canary (2.5B) | 7.5 | 8.5 | 8.0 | 9.5 | 8.5 | 8.4 |
| 5 | Kyutai 1B | 9.0 | 7.5 | 8.5 | 9.0 | 8.0 | 8.4 |
| 7 | Whisper+ (Distil-V3) | 6.5 | 7.5 | 7.0 | 8.5 | 8.0 | 7.5 |
| 8 | Qwen ASR (1.7B) | 7.0 | 7.5 | 7.5 | 6.5 | 6.0 | 6.9 |
π€ Microphone Audio (Real-World β Room Acoustics)
| Rank | Model | T1 | T2 | T3 | T4 | T5 | AVG |
|---|---|---|---|---|---|---|---|
| π₯ | VoxBar Pro (4B) | 9.5 | 9.5 | 10.0 | 10.0 | 8.5 | 9.5 |
| π₯ | Kyutai 2.6B | 9.5 | 9.5 | 9.5 | 10.0 | 8.5 | 9.4 |
| π₯ | Canary (2.5B) | 8.5 | 8.5 | 7.5 | 9.0 | 8.5 | 8.4 |
| 4 | Nemotron (0.6B) | 9.0 | 7.0 | 8.0 | 8.5 | 7.5 | 8.0 |
| 5 | Kyutai 1B | 9.5 | 7.0 | 7.5 | 7.0 | 8.0 | 7.8 |
| 6 | GLM ASR (1.5B) | 7.0 | 6.5 | 8.5 | 8.0 | 7.0 | 7.4 |
| 7 | Qwen ASR (1.7B) | 5.0 | 6.5 | 6.5 | 4.5 | 7.0 | 5.9 |
| 8 | Whisper+ (Distil-V3) | 6.0 | 5.5 | 5.5 | 6.0 | 6.0 | 5.8 |
π Combined Leaderboard
| Rank | Model | System | Mic | Combined |
|---|---|---|---|---|
| π₯ | VoxBar Pro (4B) | 9.7 | 9.5 | 9.6 |
| π₯ | Kyutai 2.6B | 9.4 | 9.4 | 9.4 |
| π₯ | Canary (2.5B) | 8.4 | 8.4 | 8.4 |
| 4 | Nemotron (0.6B) | 8.7 | 8.0 | 8.35 |
| 5 | Kyutai 1B | 8.4 | 7.8 | 8.1 |
| 6 | GLM ASR (1.5B) | 8.6 | 7.4 | 8.0 |
| 7 | Whisper+ (Distil-V3) | 7.5 | 5.8 | 6.65 |
| 8 | Qwen ASR (1.7B) | 6.9 | 5.9 | 6.4 |
So Which Model Should You Use?
You have a modern NVIDIA GPU (8GB+) and want the best:
β VoxBarβ’ Pro (Voxtral 4B). Highest score on every test. Accents, numbers, philosophy,
stream-of-consciousness
β nothing fazes it. If this is what local speech-to-text looks like now, we'll all be talking to our
computers soon. Voxtral's arrival on the scene is genuinely a step change for the entire field.
You want near-Pro quality with lower VRAM:
β Kyutai 2.6B. Only 0.2 behind Pro. Tied on accent tests. ~6GB VRAM. Remarkable for the price.
You have a budget GPU (4β6GB):
β Nemotron 0.6B. The most surprising result of the entire benchmark. A 600-million-parameter model
scoring 8.7 on system audio β outperforming models four times its size β shouldn't be possible.
But NVIDIA's CTC architecture is ruthlessly efficient. At ~2GB VRAM, it could run alongside a
game, a browser, and a video call simultaneously. This is the future of lightweight AI.
You have an AMD GPU or need cross-platform:
β Whisper+ (Distil-V3). The only model that runs on AMD GPUs via ONNX.
You want real-time (sub-second latency):
β Kyutai 1B. Text appears within 0.5 seconds of speaking.
The Hall of Memorable Errors
Every model in this table is impressive technology β built by brilliant researchers and released for the world to use. But no model is perfect, and the smallest models (Moonshine at 50M, Canary 1B) are punching enormously above their weight for their size class. These errors aren't failures β they're the trade-offs of running powerful AI on tiny hardware. We share them because they're genuinely funny, and because they show exactly where the engineering challenges remain:
| Model | What It Heard | What Was Actually Said |
|---|---|---|
| Moonshine | "apples of tea Fear" | Apple's approach to user experience |
| Canary 1B | "once a demonic" | wants a demonstration |
| Canary 1B | "and video" | NVIDIA |
| Moonshine | "signal Point 4" | 560.94 |
| Moonshine | "server went down to the bottom corner" | server went down at about three in the morning |
| Kyutai 1B | "lonely bliss" | loneliness |
| Kyutai 1B | "can troll" | confront |
| Qwen ASR | "New York" | a neuron |
| Canary 2.5B | "archeology" | an armchair |
| Whisper+ | "we're we're" | we're (apostrophe breaks the regex) |
These errors are funny, but they're also useful. They tell you exactly where each model's limits are β and they're the reason we tune.
What's Next
This was Round 1 β our tuning and positioning benchmark. We used it to push every model to its limits, figure out our product lineup, and build the rankings for our website. But we're not done.
Marathon Tests (Round 2) β Every model will be tested for 30+ minutes of continuous use. Memory stability, hallucination rates, VRAM drift, and degradation over time.
New Models β When a new model lands that meets our criteria, we'll build an engine around it and put it through the same five tests. No favourites, no sponsorships. Just scores.
Language Testing β Several models support multiple languages. We plan to extend the benchmark with non-English tests.
Why This Matters
There is no independent, practical comparison of open-source speech-to-text models anywhere on the internet. Academic benchmarks use synthetic datasets. Model creators publish flattering numbers on their own test sets. Reddit threads are anecdotal.
We built 15 working applications. We tested 8 head-to-head under identical conditions. We tuned each one until we'd found its ceiling. And we published the results β every score, every error, every fix.
If you're building a product that needs speech-to-text, you shouldn't have to guess which model to use. Now you don't have to.
Credit Where It's Due
We want to be clear: we didn't build these models. We built applications around them. The real breakthroughs belong to the research teams who created them.
One thing we couldn't help noticing: our top two models β Voxtral (Mistral AI) and Kyutai STT (Kyutai) β are both built in Paris. Two labs, same city, building the best English speech-to-text models on Earth. Whatever is happening in the Paris AI scene right now, the rest of the world should be paying attention. The French capital is quietly leading this frontier.
NVIDIA deserves special mention too. They contributed four models to our suite (Canary 2.5B, Canary 1B Flash, Nemotron 0.6B, and Parakeet TDT), and their NeMo framework made integration remarkably smooth. The Nemotron result β a 0.6B model in 4th place β is a testament to their efficiency-first engineering philosophy.
Alibaba's Qwen team, Zhipu AI (GLM), and Useful Sensors (Moonshine) all contributed valuable models with genuinely different approaches. OpenAI's Whisper remains the most broadly compatible model available β it runs on everything. Every one of these teams is pushing the boundaries and we respect the work enormously.
We're just the ones who wired it all up, put them head-to-head, and told you who won.
VoxBarβ’ is an independent product. We are not affiliated with, endorsed by, or sponsored by any of the model creators mentioned in this article.
All testing was performed in February 2026. Rankings will be updated as new models are released and tested.