How It Works
VoxBar Ultra uses NVIDIA's Parakeet TDT 0.6B v2 — a compact but devastatingly fast ASR model built on the FastConformer architecture with a Token-and-Duration Transducer. This is a single-pass, non-autoregressive model — it doesn't generate text one word at a time like an LLM. It processes the entire audio chunk in one forward pass and outputs the full transcription instantly.
Here's what happens, step by step:
- Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
- Buffers 2 seconds of audio into a small in-memory buffer
- Checks for silence — if the RMS energy is below 0.01, the chunk is skipped
- Writes a tiny temp WAV file to your system temp folder
- Feeds the WAV to Parakeet TDT via NeMo's
model.transcribe()API - The model processes the entire chunk in a single forward pass — no autoregressive token generation, no beam search delays
- Complete transcription is returned instantly — with punctuation and capitalisation included
- Temp file is immediately deleted — nothing accumulates on disk
- Text is appended to your textbox
- Repeats forever — each chunk is completely independent
The key difference from VoxBar AI: Parakeet is a dedicated ASR model, not an LLM. It doesn't "understand" language — it just maps audio to text with extraordinary precision and speed. This makes it significantly faster per chunk, at the cost of less contextual intelligence.
Recording Limits
VoxBar Ultra Has No Recording Limit
Like VoxBar AI, VoxBar Ultra runs natively on your machine with no Docker, no WebSocket, and no server process. Each 2-second chunk is completely independent — the model processes it, the temp file is deleted, and it moves on.
Why It Runs Forever
- Each chunk is self-contained — no state carries between chunks
- GPU memory is fixed at ~2GB — the smallest footprint of any GPU-accelerated VoxBar model
- No network connections, no Docker, no server processes
- The 0.6B parameter model is tiny — it never stresses your GPU
Auto-Stop Behaviour
- Silence timeout: 60 seconds of no detected speech
- Check interval: Every 5 seconds
- Designed for active dictation sessions rather than passive meeting recording
Memory & Resource Footprint
| Resource | Usage | Behaviour Over Time |
|---|---|---|
| GPU VRAM | ~2GB fixed | ✅ Never grows — smallest GPU footprint in the suite |
| RAM | ~400MB (Python process + NeMo) | ✅ Stable |
| Disk | Zero accumulation | ✅ Temp WAV files deleted immediately after each chunk |
| Network | None | ✅ Completely offline |
VoxBar Ultra is the most resource-efficient GPU model in the entire suite. At just 2GB VRAM, it runs comfortably on entry-level NVIDIA GPUs (GTX 1650, RTX 3050, etc.) that can't fit the larger models. You can run VoxBar Ultra alongside games, video editing, or other GPU-intensive tasks without worrying about VRAM pressure.
Architecture Advantage
What makes VoxBar Ultra special: It holds the #1 accuracy benchmark on LibriSpeech at just 1.69% Word Error Rate — better than models 10x its size. The FastConformer + TDT architecture is purpose-built for speech recognition:
- Single-pass inference — no autoregressive generation, no beam search. One forward pass = complete transcription
- 3,386x real-time speed — it transcribes audio 3,386 times faster than you can speak it
- Built-in punctuation and capitalisation — no post-processing needed
- Word-level timestamps — every word is tagged with its exact position in time
- Token-and-Duration Transducer — predicts both the text tokens AND their durations simultaneously, making it more accurate than standard CTC models
What users DON'T have to worry about:
- ❌ No Docker required — runs natively
- ❌ No internet connection — completely offline
- ❌ No large VRAM requirements — just 2GB
- ❌ No cloud processing — your voice stays on your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever
What users DO need to know:
- ⚠️ Text arrives in chunks (every ~2 seconds)
- ⚠️ NVIDIA GPU required — needs CUDA (no AMD or Apple support)
- ⚠️ English-focused — Parakeet TDT is optimised for English; multilingual support is limited
- ⚠️ First launch downloads ~1.2GB model files (cached after that)
Accuracy & Speed
| Metric | Value |
|---|---|
| Delivery | Chunked — text appears every ~2 seconds |
| Latency | ~0.5-1 second processing time per chunk (extremely fast) |
| Word Error Rate | 1.69% (LibriSpeech benchmark — best in class) |
| Inference Speed | 3,386x real-time |
| Punctuation | Yes — built-in, automatic |
| Capitalisation | Yes — built-in, automatic |
| Languages | English (primary), limited multilingual |
| Timestamps | Word-level timestamps available |
The Speed Advantage
Parakeet TDT processes each audio chunk in a fraction of a second. A 2-second audio clip is transcribed in roughly 0.6 milliseconds. This means the bottleneck isn't the model — it's how fast audio arrives. VoxBar Ultra feels snappier than VoxBar AI because there's almost zero processing delay once a chunk is ready.
Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA with 2GB VRAM | NVIDIA with 4GB+ VRAM |
| GPU (AMD) | ❌ Not supported | — |
| GPU (Apple) | ❌ Not supported | — |
| RAM | 8GB | 16GB |
| Disk | ~1.2GB for model (cached in ~/.cache) | SSD recommended |
| OS | Windows 10/11 | Windows 11 |
| Software | Python 3.10+, NeMo toolkit | pip install nemo_toolkit['asr'] |
| Docker | ❌ Not required | — |
License & Attribution
| Detail | Value |
|---|---|
| Model | nvidia/parakeet-tdt-0.6b-v2 |
| Creator | NVIDIA |
| License | CC-BY-4.0 (commercially usable with attribution) |
| Attribution | Required — credit NVIDIA in product documentation |
| Distribution | Can be bundled and sold commercially |
Where It Fits in the Suite
| Feature | VoxBar Pro | VoxBar AI | VoxBar Ultra |
|---|---|---|---|
| Accuracy | ★★★★★ | ★★★★★ | ★★★★★ (1.69% WER — best benchmark) |
| Text delivery | Real-time | Every 1.5s | Every 2s |
| Processing speed | Streaming | 418x real-time | 3,386x real-time |
| VRAM | ~8-10GB | ~6-8GB | ~2GB |
| Docker | Yes | No | No |
| Languages | Multi | Multi | English-focused |
| Best for | Live presentations | Long dictation | Fast English transcription on any NVIDIA GPU |
Bottom line: VoxBar Ultra is the speed and efficiency king. If you have an NVIDIA GPU — even a modest one — and you primarily work in English, VoxBar Ultra gives you benchmark-leading accuracy at a fraction of the resource cost. It's the model that punches way above its weight.