VoxBar Ultra — Powered by NVIDIA Parakeet TDT 0.6B v2 (FastConformer + Token-and-Duration Transducer)

How It Works

VoxBar Ultra uses NVIDIA's Parakeet TDT 0.6B v2 — a compact but devastatingly fast ASR model built on the FastConformer architecture with a Token-and-Duration Transducer. This is a single-pass, non-autoregressive model — it doesn't generate text one word at a time like an LLM. It processes the entire audio chunk in one forward pass and outputs the full transcription instantly.

Here's what happens, step by step:

Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
Buffers 2 seconds of audio into a small in-memory buffer
Checks for silence — if the RMS energy is below 0.01, the chunk is skipped
Writes a tiny temp WAV file to your system temp folder
Feeds the WAV to Parakeet TDT via NeMo's model.transcribe() API
The model processes the entire chunk in a single forward pass — no autoregressive token generation, no beam search delays
Complete transcription is returned instantly — with punctuation and capitalisation included
Temp file is immediately deleted — nothing accumulates on disk
Text is appended to your textbox
Repeats forever — each chunk is completely independent

The key difference from VoxBar AI: Parakeet is a dedicated ASR model, not an LLM. It doesn't "understand" language — it just maps audio to text with extraordinary precision and speed. This makes it significantly faster per chunk, at the cost of less contextual intelligence.

Recording Limits

VoxBar Ultra Has No Recording Limit

Like VoxBar AI, VoxBar Ultra runs natively on your machine with no Docker, no WebSocket, and no server process. Each 2-second chunk is completely independent — the model processes it, the temp file is deleted, and it moves on.

Why It Runs Forever

Each chunk is self-contained — no state carries between chunks
GPU memory is fixed at ~2GB — the smallest footprint of any GPU-accelerated VoxBar model
No network connections, no Docker, no server processes
The 0.6B parameter model is tiny — it never stresses your GPU

Auto-Stop Behaviour

Silence timeout: 60 seconds of no detected speech
Check interval: Every 5 seconds
Designed for active dictation sessions rather than passive meeting recording

Memory & Resource Footprint

Resource	Usage	Behaviour Over Time
GPU VRAM	~2GB fixed	✅ Never grows — smallest GPU footprint in the suite
RAM	~400MB (Python process + NeMo)	✅ Stable
Disk	Zero accumulation	✅ Temp WAV files deleted immediately after each chunk
Network	None	✅ Completely offline

VoxBar Ultra is the most resource-efficient GPU model in the entire suite. At just 2GB VRAM, it runs comfortably on entry-level NVIDIA GPUs (GTX 1650, RTX 3050, etc.) that can't fit the larger models. You can run VoxBar Ultra alongside games, video editing, or other GPU-intensive tasks without worrying about VRAM pressure.

Architecture Advantage

What makes VoxBar Ultra special: It holds the #1 accuracy benchmark on LibriSpeech at just 1.69% Word Error Rate — better than models 10x its size. The FastConformer + TDT architecture is purpose-built for speech recognition:

Single-pass inference — no autoregressive generation, no beam search. One forward pass = complete transcription
3,386x real-time speed — it transcribes audio 3,386 times faster than you can speak it
Built-in punctuation and capitalisation — no post-processing needed
Word-level timestamps — every word is tagged with its exact position in time
Token-and-Duration Transducer — predicts both the text tokens AND their durations simultaneously, making it more accurate than standard CTC models

What users DON'T have to worry about:
- ❌ No Docker required — runs natively
- ❌ No internet connection — completely offline
- ❌ No large VRAM requirements — just 2GB
- ❌ No cloud processing — your voice stays on your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever

What users DO need to know:
- ⚠️ Text arrives in chunks (every ~2 seconds)
- ⚠️ NVIDIA GPU required — needs CUDA (no AMD or Apple support)
- ⚠️ English-focused — Parakeet TDT is optimised for English; multilingual support is limited
- ⚠️ First launch downloads ~1.2GB model files (cached after that)

Accuracy & Speed

Metric	Value
Delivery	Chunked — text appears every ~2 seconds
Latency	~0.5-1 second processing time per chunk (extremely fast)
Word Error Rate	1.69% (LibriSpeech benchmark — best in class)
Inference Speed	3,386x real-time
Punctuation	Yes — built-in, automatic
Capitalisation	Yes — built-in, automatic
Languages	English (primary), limited multilingual
Timestamps	Word-level timestamps available

The Speed Advantage

Parakeet TDT processes each audio chunk in a fraction of a second. A 2-second audio clip is transcribed in roughly 0.6 milliseconds. This means the bottleneck isn't the model — it's how fast audio arrives. VoxBar Ultra feels snappier than VoxBar AI because there's almost zero processing delay once a chunk is ready.

Hardware Requirements

Requirement	Minimum	Recommended
GPU	NVIDIA with 2GB VRAM	NVIDIA with 4GB+ VRAM
GPU (AMD)	❌ Not supported	—
GPU (Apple)	❌ Not supported	—
RAM	8GB	16GB
Disk	~1.2GB for model (cached in ~/.cache)	SSD recommended
OS	Windows 10/11	Windows 11
Software	Python 3.10+, NeMo toolkit	pip install nemo_toolkit['asr']
Docker	❌ Not required	—

License & Attribution

Detail	Value
Model	nvidia/parakeet-tdt-0.6b-v2
Creator	NVIDIA
License	CC-BY-4.0 (commercially usable with attribution)
Attribution	Required — credit NVIDIA in product documentation
Distribution	Can be bundled and sold commercially

Where It Fits in the Suite

Feature	VoxBar Pro	VoxBar AI	VoxBar Ultra
Accuracy	★★★★★	★★★★★	★★★★★ (1.69% WER — best benchmark)
Text delivery	Real-time	Every 1.5s	Every 2s
Processing speed	Streaming	418x real-time	3,386x real-time
VRAM	~8-10GB	~6-8GB	~2GB
Docker	Yes	No	No
Languages	Multi	Multi	English-focused
Best for	Live presentations	Long dictation	Fast English transcription on any NVIDIA GPU

Bottom line: VoxBar Ultra is the speed and efficiency king. If you have an NVIDIA GPU — even a modest one — and you primarily work in English, VoxBar Ultra gives you benchmark-leading accuracy at a fraction of the resource cost. It's the model that punches way above its weight.