Quantizing Voxtral for Real-Time Transcription

Real-time transcription is easy to underestimate.

Getting words onto a screen is one thing. Getting them there quickly, accurately, and smoothly enough that the experience feels natural is something else entirely.

Over the last phase of development, we spent a lot of time testing Voxtral across different configurations, quantization methods, and delivery paths to answer a practical question: what actually gives the best balance of streaming feel, accuracy, VRAM usage, and deployability on real hardware?

Some experiments worked. Some did not. Some looked promising until we pushed them properly. But by the end of it, we came away with a much clearer understanding of where Voxtral performs best, where quantization helps, and where it simply is not worth the trade‑off.

The real challenge was not just accuracy

One of the first things we learned is that transcription quality on its own is not enough.

A model can be accurate and still feel wrong.

In real-time speech products, users notice timing issues immediately. They notice when the last word of a sentence is held back. They notice when text arrives in uneven drops instead of flowing naturally. They notice when the output technically makes sense but does not match the rhythm of speech.

That is where chunk-based delivery started to show its limits. With chunking, each audio segment is effectively its own inference event. Even with overlap and tuning, the model loses continuity at boundaries. That creates weakness around sentence endings, partial words, and transitions between chunks.

In practice, that led to the kinds of problems many teams run into:

words going missing at chunk edges
duplicate or conflicting text when overlap was introduced
the final word or two of an utterance being held back
a delayed, stepped feel that never quite matched true streaming

We improved some of this through testing and tuning, but it eventually became clear that some of the issue was structural rather than parametric. That pushed us toward two separate goals: finding the best native streaming experience, and finding the best low‑VRAM quantized option.

The strongest result: native Voxtral 4B in F16

Our best overall result came from Voxtral 4B in F16, running natively. This setup gave us what we had really been aiming for:

smooth streaming
strong accuracy
no Docker, no WSL, no WebSocket bridge
single-installer deployment
around 8.5 GB of dedicated VRAM

That was the breakthrough. Getting a similarly strong result from a native build meant we could keep the responsiveness and quality we wanted without the extra deployment overhead.

Just as importantly, we benchmarked it properly against our earlier Docker gold standard:

Configuration	Score
Docker reference	9.6
Native Voxtral 4B F16	9.5

That is near‑match territory in both quality and UX. This was not just "good for a native build." It was close enough to the benchmark we trusted most that we now consider it production ready.

It is worth saying that we did not get there instantly. A lot of our effort went into solving one specific and frustrating UX issue: the model would sometimes hold back the final word, or occasionally the final two words, and only release them once the next bit of audio arrived. We improved that behaviour significantly, and that made a major difference to the final user experience.

Once that was largely under control, the native F16 path stopped feeling like an experiment and started feeling like a product.

Where Hugging Face INT8 fell short

Once the F16 build was performing well, the next obvious question was whether quantization could push the memory requirement down even further without compromising the experience. Naturally, we looked at a Hugging Face INT8 path.

On paper, that seemed promising. In practice, it was not.

Our tests showed a very clear split:

Configuration	Streaming text?	VRAM	Latency
F16	✓ Works	8.5 GB	7.4s
Full INT8	✗ Pad‑only	7.7 GB	29s
INT8 (audio in full precision)	✓ Works	7.8 GB	19.5s

Instead of producing useful streaming text, the full INT8 setup collapsed into repeated pad‑style output rather than real words. We investigated this carefully and tried selective skip strategies to keep potentially sensitive modules in higher precision.

That process was useful because it showed that the problem was not a basic loading or implementation error. The failure was tied to precision‑sensitive behaviour inside a realtime architecture that relies on special token transitions and decoder decisions to move from streaming control states into actual text generation.

We eventually got hybrid quantized setups to function by leaving the audio encoder, the multi‑modal projector, and the output head in higher precision. But once we did that, most of the expected VRAM savings disappeared, and latency got worse.

For this realtime Voxtral setup, F16 remained the best production configuration. It was simpler, faster, more stable, and more memory‑efficient in practice than any of the quantized alternatives we tested.

The more interesting quantized result: Q8 on chunked delivery

That does not mean the quantization work was wasted. One of the more useful outcomes came from a Q8 quantized Voxtral build using chunk‑based delivery.

This version had a different character from the native F16 streamer. Its biggest weakness was the way text arrived — even when the transcription quality was strong, the delivery felt more delayed than the native realtime path. The output came through in more obvious drops rather than with the smoother rolling feel.

Still, it had some very real strengths:

around 5 GB GPU usage
strong practical accuracy
viable performance on smaller graphics cards

What made it more meaningful was how it compared to our next‑best smaller‑GPU model: Kyutai 2.6B. In practical testing, the quantized Voxtral path ended up in very similar territory on VRAM usage, overall accuracy, and general live‑transcription feel.

That is not a small result. On smaller cards, being able to stand alongside one of the stronger competitors in that class is already significant.

What these results mean

The biggest lesson from all of this is simple: there is no single "best" model without context.

Priority	Best option	Why
Best overall UX	Native Voxtral 4B F16	Near‑gold‑standard quality, simple deployment, ~8.5 GB VRAM
Smaller GPUs	Q8 chunked Voxtral	Strong accuracy at ~5 GB, competes with Kyutai 2.6B tier
HF INT8 in production	Not recommended	Marginal VRAM savings (~7.8 GB), but 2.5× slower than F16

Where we go from here

We do not see this as the end of the work, but it is a strong milestone. The native F16 build is now close enough to our Docker gold standard that we can treat it as a serious production baseline rather than just an internal success story.

There is still room to improve around:

microphone behaviour and input levels
numbers and percentages
longer benchmark runs
tougher real‑world edge cases
continued polish to transcription behaviour under varied speech conditions

But those are the kinds of improvements you make when the foundation is already working. And that is a much better place to be.

Final thoughts

Not every experiment delivered what we hoped for. Some of the most attractive‑looking quantization paths turned out not to be worth the trade‑offs. Some of the most useful progress came from solving smaller, more practical UX issues rather than chasing bigger‑sounding technical shortcuts.

That is usually how this kind of work goes. You test honestly. You compare against reality instead of assumptions. You keep the approaches that hold up, and you move on from the ones that do not.

Where we have landed is a much clearer picture of the stack:

Voxtral 4B F16 gives us a near‑match to our Docker gold standard at around 8.5 GB VRAM
Hugging Face INT8 did not produce a worthwhile production path for this setup
Quantized Q8 Voxtral at around 5 GB VRAM remains a strong low‑VRAM option and compares well against serious competition on smaller cards

For us, that feels like real progress.

If you are working on similar realtime transcription problems, or just following along with where these models are heading, we will be sharing more as we continue testing and refining the production path.

VoxBar™ is an independent product. We are not affiliated with, endorsed by, or sponsored by Mistral AI or any other model creator mentioned in this article.

All testing was performed in March 2026 using an NVIDIA RTX 4080 SUPER (16 GB VRAM) on Windows 11. Results may vary on different hardware.

Quantizing Voxtral for Real‑Time Transcription: What Worked, What Didn't, and What We Learned