We Tested Every Open-Source STT Model We Could Find

Why We Did This

If you search for "best open-source speech-to-text model" today, you'll find benchmark papers, Hugging Face leaderboards, and scattered Reddit threads. What you won't find is someone who actually built working applications around every serious contender and tested them all under identical, real-world conditions.

That's because it's a ridiculous amount of work. We know — we did it anyway.

VoxBar™ started as a single transcription app powered by one model. But we kept running into the same question our customers would ask: "Which model is actually best for me?" The benchmark papers couldn't answer that. Word Error Rate on LibriSpeech doesn't tell you what happens when a German philosopher says "simulacrum" at full speed, or when a Greek economist rattles off "$35 trillion" mid-sentence.

So we decided to find out for ourselves.

The Search: 15 Models, 6 Architectures

We scoured every major model hub — Hugging Face, NVIDIA NGC, GitHub — for English-capable speech-to-text models with open or permissive licensing. Our criteria were simple:

Must run locally — no cloud APIs, no data leaving the machine
Must have a usable license — Apache 2.0, MIT, CC-BY-4.0, or equivalent
Must be a serious contender — published by a real research lab

We found 15 models worth building on consumer hardware. There may be a handful of others out there — older models, or projects that didn't quite reach the frontier — but we focused on the ones that are genuinely pushing the boundaries of what's possible right now. With Voxtral and Kyutai leading the pack, and Qwen close behind, we believe we've captured the state of the art.

#	Model	Creator	Size	License
1	Voxtral Realtime 4B	Mistral AI	4B	Apache 2.0
2	Kyutai STT 2.6B	Kyutai (Paris)	2.6B	CC-BY-4.0
3	Canary Qwen 2.5B	NVIDIA	2.5B	CC-BY-4.0
4	Qwen3-ASR 1.7B	Alibaba	1.7B	Apache 2.0
5	GLM-ASR-Nano 1.5B	Zhipu AI	1.5B	Apache 2.0
6	Kyutai STT 1B	Kyutai (Paris)	1B	MIT
7	Canary 1B Flash	NVIDIA	1B	CC-BY-4.0
8	Distil-Whisper V3	OpenAI (distilled)	756M	MIT
9	Faster-Whisper Large-V3	OpenAI	1.5B	MIT
10	Faster-Whisper (base)	OpenAI	244M	MIT
11	Parakeet TDT 0.6B	NVIDIA	0.6B	CC-BY-4.0
12	Nemotron Speech 0.6B	NVIDIA	0.6B	NVIDIA OML
13	Qwen3-ASR 0.6B	Alibaba	0.6B	Apache 2.0
14	SenseVoice-Small	Alibaba	234M	Apache 2.0
15	Moonshine v2	Useful Sensors	50M	MIT

We wrapped each one of these engines in our VoxBar™ application — same UI, same audio pipeline, same start/stop behaviour — and put it through rigorous tuning and stress testing before ranking it in the arena.

The Arena: Equal Conditions, Real Content

Academic benchmarks use clean, pre-recorded datasets like LibriSpeech — studio-quality audio, read from scripts, with no accents, no interruptions. They measure Word Error Rate on audio that sounds nothing like what people actually transcribe.

We wanted to test under real-world conditions — the same stuff that people actually throw at a transcription tool, captured through a standard desktop webcam mic and through system audio. We picked five YouTube clips:

Test	Content	Why This Matters
T1	Reading a List (Numbers & Words)	Can it count? Handle structured content?
T2	Lecture (Varoufakis — Greek accent)	Numbers ($35 trillion, 85%), proper nouns, accent
T3	Podcast (Dawkins on Darwin)	Nested philosophy, multiple proper nouns
T4	Accent (Joscha Bach — German)	Dense vocab + accent: "simulacrum", "retina"
T5	Accent (Daniel Dennett — American)	Narrative, stream-of-consciousness

The accent diversity was deliberate. Across the five tests we covered British (T3 — Dawkins), Greek (T2 — Varoufakis), German (T4 — Bach), and American (T5 — Dennett) speakers. If a model can handle all four, it can handle your users.

The Dual-Mode Rule

Every test was run twice:

System Audio (WASAPI loopback) — We played the exact same YouTube video through system audio capture. Every model heard the identical audio signal. This is the "controlled experiment" — perfectly repeatable.

Live Microphone — Same clip played through speakers, captured with a desk mic. Same room, same distance, same volume. This is the "real-world test."

This dual-mode approach revealed something the benchmarks never show: some models are dramatically better on system audio than mic audio, and vice versa. GLM ASR scores 8.6 on system but drops to 7.4 on mic — a 1.2 point gap. Canary 2.5B scores 8.4 on both — perfectly balanced.

The Hardware

GPU: NVIDIA RTX 4080 SUPER (16GB VRAM)
OS: Windows 11
Audio A: WASAPI loopback (system audio)
Audio B: Default desk microphone

No cherry-picking. No "best of three." One run per mode, scored as-is.

The Surprise: Size Doesn't Predict Quality

Before testing, we assumed the rankings would roughly follow model size. A 4B model should beat a 2.6B model, which should beat a 1.5B model.

We were wrong.

Nemotron — 0.6 billion parameters — finished 4th overall, beating Canary at 2.5 billion and GLM at 1.5 billion. It scored 8.7 on system audio, higher than models four times its size. The reason lies in architecture.

Three Architecture Families

Through testing, we discovered that how a model processes audio matters far more than how many parameters it has:

1. Streaming (Real-Time) — Models like Kyutai and Voxtral process audio frame-by-frame as it arrives. No chunks, no boundaries. They work especially well when you speak deliberately in clear statements, and through system audio they can transcribe near-perfectly. Voxtral in particular produces flawless output at the top tier. The smaller streaming models can occasionally produce phantom words during long pauses, but this is manageable with good pipeline design.

2. Chunked (Batch) — Models like Whisper, Canary, Nemotron, GLM, and Qwen ASR receive audio in fixed segments (2–8 seconds). Simpler to implement, but creates boundary artifacts when sentences span two chunks.

3. Event-Driven — Moonshine uses a voice activity detector to decide when to transcribe. Works on paper, but missed speech is lost forever.

Error Type	Architecture	Example
Boundary stutters	Chunked	"the Supreme Court's The Supreme Court struck down"
Silence hallucinations	Streaming	Phantom words produced during pauses
Repeated endings	Chunked generative	Canary repeating 20 words at end of test
Sentence fragmentation	Chunked generative	GLM inserting periods mid-sentence
All-lowercase output	Streaming	Kyutai system audio — no capitalisation

The biggest insight: most of the issues we encountered weren't really model errors at all — they were integration challenges. The models themselves were hearing the right words. The problems were in how the audio was chunked, how overlapping segments were merged, and how the output was assembled. Once we fixed the pipeline and post-processing, models jumped dramatically in score — without changing a single model weight.

The Tuning Breakthroughs

The first round of raw, untuned scores was sobering. Nemotron scored 5.4. GLM scored 5.4. Models that sounded impressive on paper were producing stuttering, fragmented, borderline-unusable transcripts.

But we noticed something. The content was usually right — the models were hearing the correct words. The mess was in how the output was assembled. These weren't model failures. They were engineering failures. So we started fixing them.

A note on honesty: none of these techniques are novel research contributions. Smart chunking, overlap deduplication, and stutter cleaning are well-understood audio engineering practices. What's genuinely new is that nobody — as far as we can find — has applied them systematically across this many models and published the comparative results. We're not claiming to have advanced the science. We did the unglamorous engineering work that makes these models production-ready.

Fix 1: Smart Chunking (Nemotron — 5.4 → 8.35)

Nemotron's original setup used fixed 2-second chunks. Words were being cut in half, and the model would hallucinate endings. The fix: silence-aware chunking — instead of cutting every 2 seconds, we wait for a natural pause (up to 8 seconds). This single change eliminated tail hallucinations.

Fix 2: Sliding Window Dedup (Nemotron, GLM)

Chunked models with audio overlap transcribe the same phrase twice. We built a 12-word sliding window that compares the end of each chunk against the start of the next, stripping duplicates.

Fix 3: Cross-Sentence Stutter Cleaning (Qwen → GLM → Nemotron)

The discovery that changed everything. While fixing Qwen ASR, we built a stutter cleaner that catches repeated words across sentence boundaries — "Pritzker. Pritzker" or "Foreign. Foreign policy". When we ported this to GLM and Nemotron, both improved immediately.

Fix 4: Loopback Lookahead Buffer (Kyutai 1B)

Kyutai 1B had a bizarre problem: system audio was worse than mic. "Loneliness" became "lonely bliss." The fix was counterintuitive — a 400ms buffer that gives the streaming model time to hear the full word before committing. Every mishearing was fixed.

Fix 5: Hallucination Spiral Reset (Kyutai 1B)

When Kyutai hallucinates, the hallucinated text poisons its own context, causing more hallucinations. We built a circuit breaker: three hallucinations in 30 seconds triggers a pipeline reset. The cascade breaks and the model starts fresh.

The Score Jumps

Model	Before	After	Improvement
Nemotron 0.6B	5.4	8.35	+2.95
GLM 1.5B	5.4	8.0	+2.6
Kyutai 1B	7.5	8.1	+0.6

Nemotron jumped from dead last to 4th place, beating models four times its size. Not a single model weight was changed. Every fix was in the audio pipeline and post-processing.

The Final Rankings

After tuning every model to its best possible performance, here's where they landed.

📢 System Audio (Controlled — Identical Audio Input)

Rank	Model	T1	T2	T3	T4	T5	AVG
🥇	VoxBar Pro (4B)	9.5	9.5	10.0	10.0	9.5	9.7
🥈	Kyutai 2.6B	9.5	9.5	9.5	10.0	8.5	9.4
🥉	Nemotron (0.6B)	8.5	8.0	9.5	9.0	8.5	8.7
4	GLM ASR (1.5B)	8.5	8.0	9.0	9.0	8.5	8.6
5	Canary (2.5B)	7.5	8.5	8.0	9.5	8.5	8.4
5	Kyutai 1B	9.0	7.5	8.5	9.0	8.0	8.4
7	Whisper+ (Distil-V3)	6.5	7.5	7.0	8.5	8.0	7.5
8	Qwen ASR (1.7B)	7.0	7.5	7.5	6.5	6.0	6.9

🎤 Microphone Audio (Real-World — Room Acoustics)

Rank	Model	T1	T2	T3	T4	T5	AVG
🥇	VoxBar Pro (4B)	9.5	9.5	10.0	10.0	8.5	9.5
🥈	Kyutai 2.6B	9.5	9.5	9.5	10.0	8.5	9.4
🥉	Canary (2.5B)	8.5	8.5	7.5	9.0	8.5	8.4
4	Nemotron (0.6B)	9.0	7.0	8.0	8.5	7.5	8.0
5	Kyutai 1B	9.5	7.0	7.5	7.0	8.0	7.8
6	GLM ASR (1.5B)	7.0	6.5	8.5	8.0	7.0	7.4
7	Qwen ASR (1.7B)	5.0	6.5	6.5	4.5	7.0	5.9
8	Whisper+ (Distil-V3)	6.0	5.5	5.5	6.0	6.0	5.8

🏆 Combined Leaderboard

Rank	Model	System	Mic	Combined
🥇	VoxBar Pro (4B)	9.7	9.5	9.6
🥈	Kyutai 2.6B	9.4	9.4	9.4
🥉	Canary (2.5B)	8.4	8.4	8.4
4	Nemotron (0.6B)	8.7	8.0	8.35
5	Kyutai 1B	8.4	7.8	8.1
6	GLM ASR (1.5B)	8.6	7.4	8.0
7	Whisper+ (Distil-V3)	7.5	5.8	6.65
8	Qwen ASR (1.7B)	6.9	5.9	6.4

So Which Model Should You Use?

You have a modern NVIDIA GPU (8GB+) and want the best:
→ VoxBar™ Pro (Voxtral 4B). Highest score on every test. Accents, numbers, philosophy, stream-of-consciousness — nothing fazes it. If this is what local speech-to-text looks like now, we'll all be talking to our computers soon. Voxtral's arrival on the scene is genuinely a step change for the entire field.

You want near-Pro quality with lower VRAM:
→ Kyutai 2.6B. Only 0.2 behind Pro. Tied on accent tests. ~6GB VRAM. Remarkable for the price.

You have a budget GPU (4–6GB):
→ Nemotron 0.6B. The most surprising result of the entire benchmark. A 600-million-parameter model scoring 8.7 on system audio — outperforming models four times its size — shouldn't be possible. But NVIDIA's CTC architecture is ruthlessly efficient. At ~2GB VRAM, it could run alongside a game, a browser, and a video call simultaneously. This is the future of lightweight AI.

You have an AMD GPU or need cross-platform:
→ Whisper+ (Distil-V3). The only model that runs on AMD GPUs via ONNX.

You want real-time (sub-second latency):
→ Kyutai 1B. Text appears within 0.5 seconds of speaking.

The Hall of Memorable Errors

Every model in this table is impressive technology — built by brilliant researchers and released for the world to use. But no model is perfect, and the smallest models (Moonshine at 50M, Canary 1B) are punching enormously above their weight for their size class. These errors aren't failures — they're the trade-offs of running powerful AI on tiny hardware. We share them because they're genuinely funny, and because they show exactly where the engineering challenges remain:

Model	What It Heard	What Was Actually Said
Moonshine	"apples of tea Fear"	Apple's approach to user experience
Canary 1B	"once a demonic"	wants a demonstration
Canary 1B	"and video"	NVIDIA
Moonshine	"signal Point 4"	560.94
Moonshine	"server went down to the bottom corner"	server went down at about three in the morning
Kyutai 1B	"lonely bliss"	loneliness
Kyutai 1B	"can troll"	confront
Qwen ASR	"New York"	a neuron
Canary 2.5B	"archeology"	an armchair
Whisper+	"we're we're"	we're (apostrophe breaks the regex)

These errors are funny, but they're also useful. They tell you exactly where each model's limits are — and they're the reason we tune.

What's Next

This was Round 1 — our tuning and positioning benchmark. We used it to push every model to its limits, figure out our product lineup, and build the rankings for our website. But we're not done.

Marathon Tests (Round 2) — Every model will be tested for 30+ minutes of continuous use. Memory stability, hallucination rates, VRAM drift, and degradation over time.

New Models — When a new model lands that meets our criteria, we'll build an engine around it and put it through the same five tests. No favourites, no sponsorships. Just scores.

Language Testing — Several models support multiple languages. We plan to extend the benchmark with non-English tests.

Why This Matters

There is no independent, practical comparison of open-source speech-to-text models anywhere on the internet. Academic benchmarks use synthetic datasets. Model creators publish flattering numbers on their own test sets. Reddit threads are anecdotal.

We built 15 working applications. We tested 8 head-to-head under identical conditions. We tuned each one until we'd found its ceiling. And we published the results — every score, every error, every fix.

If you're building a product that needs speech-to-text, you shouldn't have to guess which model to use. Now you don't have to.

Credit Where It's Due

We want to be clear: we didn't build these models. We built applications around them. The real breakthroughs belong to the research teams who created them.

One thing we couldn't help noticing: our top two models — Voxtral (Mistral AI) and Kyutai STT (Kyutai) — are both built in Paris. Two labs, same city, building the best English speech-to-text models on Earth. Whatever is happening in the Paris AI scene right now, the rest of the world should be paying attention. The French capital is quietly leading this frontier.

NVIDIA deserves special mention too. They contributed four models to our suite (Canary 2.5B, Canary 1B Flash, Nemotron 0.6B, and Parakeet TDT), and their NeMo framework made integration remarkably smooth. The Nemotron result — a 0.6B model in 4th place — is a testament to their efficiency-first engineering philosophy.

Alibaba's Qwen team, Zhipu AI (GLM), and Useful Sensors (Moonshine) all contributed valuable models with genuinely different approaches. OpenAI's Whisper remains the most broadly compatible model available — it runs on everything. Every one of these teams is pushing the boundaries and we respect the work enormously.

We're just the ones who wired it all up, put them head-to-head, and told you who won.

VoxBar™ is an independent product. We are not affiliated with, endorsed by, or sponsored by any of the model creators mentioned in this article.

All testing was performed in February 2026. Rankings will be updated as new models are released and tested.

We Tested Every Open-Source Speech-to-Text Model We Could Get Our Hands On. Here's What We Found.