How It Works
VoxBar Flash uses NVIDIA's Canary 1B Flash — a compact, fast ASR model from the same Canary family as VoxBar AI's 2.5B powerhouse, but at a fraction of the resource cost. While VoxBar AI uses the SALM (Speech-Augmented Language Model) architecture with a full LLM backbone, Canary 1B Flash uses the standard NeMo ASR pipeline — the same proven architecture as VoxBar Ultra's Parakeet TDT.
Here's what happens, step by step:
- Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
- Buffers 2 seconds of audio into a small in-memory buffer
- Checks for silence — if the RMS energy is below 0.01, the chunk is skipped
- Writes a tiny temp WAV file to your system temp folder
- Feeds the WAV to Canary 1B Flash via NeMo's standard
model.transcribe()API - Single-pass transcription — the model processes the full chunk and returns complete text
- Temp file is immediately deleted — nothing accumulates on disk
- Text is appended to your textbox
- Repeats forever — each chunk is completely independent
The "Flash" Advantage
The name says it all — Canary 1B Flash is designed for speed. At just 1 billion parameters (vs 2.5B for VoxBar AI), it loads faster, processes faster, and uses roughly half the VRAM. It's the sweet spot between VoxBar Ultra's ultra-lightweight 0.6B Parakeet and VoxBar AI's heavyweight 2.5B SALM.
Recording Limits
VoxBar Flash Has No Recording Limit
Like all non-Docker VoxBar models, Flash runs natively on your machine with no container, no server, and no WebSocket. Each 2-second chunk is completely independent — the GPU processes the same model with the same input size every time.
Flush-on-Stop
When you press Stop, VoxBar Flash transcribes any remaining audio still in the buffer before shutting down. You never lose the last words of a sentence — even if you stop mid-speech.
Auto-Stop Behaviour
- Silence timeout: 60 seconds of no detected speech
- Check interval: Every 5 seconds
Memory & Resource Footprint
| Resource | Usage | Behaviour Over Time |
|---|---|---|
| GPU VRAM | ~3-4GB fixed | ✅ Never grows — same model, same chunk size, forever |
| RAM | ~400MB | ✅ Stable |
| Disk | Zero accumulation | ✅ Temp WAV files deleted immediately after each chunk |
| Network | None | ✅ Completely offline |
The VRAM Sweet Spot
VoxBar Flash sits perfectly in the VRAM gap between the other models:
| Model | VRAM |
|---|---|
| VoxBar Pro (Voxtral 4B) | ~8-10GB |
| VoxBar AI (Canary Qwen 2.5B) | ~6-8GB |
| VoxBar Flash (Canary 1B Flash) | ~3-4GB |
| VoxBar Ultra (Parakeet TDT 0.6B) | ~2GB |
This makes VoxBar Flash ideal for users with mid-range NVIDIA GPUs — the RTX 3060 (6GB), GTX 1660 (6GB), or even the RTX 4060 (8GB) where you want to leave headroom for other applications.
Architecture Advantage
What makes VoxBar Flash special: It brings NVIDIA's Canary-family accuracy to a lower VRAM budget without requiring the cutting-edge NeMo trunk, PyTorch 2.6+, or SALM architecture that VoxBar AI demands.
Technical difference from VoxBar AI:
| Aspect | VoxBar AI (2.5B) | VoxBar Flash (1B) |
|---|---|---|
| Architecture | SALM (Speech + LLM) | FastConformer ASR |
| API | SALM.generate() |
ASRModel.transcribe() |
| NeMo version | Trunk (bleeding edge) | Stable (≥2.0) |
| PyTorch | 2.6+ required | 2.1+ works |
| Model loading | speechlm2.models.SALM |
asr.models.ASRModel |
| Inference | LLM token generation | Single-pass ASR |
This means VoxBar Flash is easier to install, more stable, and more compatible with existing NeMo setups. It doesn't need the experimental SALM codebase.
What users DON'T have to worry about:
- ❌ No Docker required — runs natively
- ❌ No internet connection — completely offline
- ❌ No bleeding-edge dependencies — works with stable NeMo
- ❌ No special virtual environment — standard pip install
- ❌ No cloud processing — your voice stays on your machine
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever
What users DO need to know:
- ⚠️ Text arrives in chunks (every ~2 seconds)
- ⚠️ NVIDIA GPU required — needs CUDA (no AMD or Apple support)
- ⚠️ 3-4GB VRAM — needs at least a mid-range NVIDIA GPU
- ⚠️ First launch downloads ~2GB model files (cached after that)
Accuracy & Speed
| Metric | Value |
|---|---|
| Delivery | Chunked — text appears every ~2 seconds |
| Latency | ~1 second processing time per chunk |
| Word Error Rate | ~5-7% (between Parakeet's 1.69% and Whisper base's 8-10%) |
| Punctuation | Yes — built-in, automatic |
| Capitalisation | Yes — built-in, automatic |
| Languages | English (primary), multilingual potential |
Where It Sits on Accuracy
Canary 1B Flash delivers better accuracy than Whisper and comparable accuracy to Parakeet in real-world use, while using more VRAM than Parakeet but less than VoxBar AI. It's the balanced middle ground.
Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA with 3GB VRAM | NVIDIA with 4GB+ VRAM |
| GPU (AMD) | ❌ Not supported | — |
| GPU (Apple) | ❌ Not supported | — |
| RAM | 8GB | 16GB |
| Disk | ~2GB for model (cached in ~/.cache) | SSD recommended |
| OS | Windows 10/11 | Windows 11 |
| Software | Python 3.10+, NeMo ≥2.0 | pip install nemo_toolkit['asr'] |
| Docker | ❌ Not required | — |
| PyTorch | 2.1+ (standard, not bleeding edge) | Standard CUDA install |
License & Attribution
| Detail | Value |
|---|---|
| Model | nvidia/canary-1b-flash |
| Creator | NVIDIA |
| License | CC-BY-4.0 (commercially usable with attribution) |
| Attribution | Required — credit NVIDIA in product documentation |
| Distribution | Can be bundled and sold commercially |
Where It Fits in the Suite
| Feature | VoxBar Pro | VoxBar AI | VoxBar Ultra | VoxBar Flash | VoxBar Lite | VoxBar Whisper |
|---|---|---|---|---|---|---|
| Accuracy | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★☆ | ★★★☆☆ | ★★★★☆ |
| VRAM | ~8-10GB | ~6-8GB | ~2GB | ~3-4GB | 0GB | 0-2GB |
| Docker | Yes | No | No | No | No | No |
| NeMo | N/A | Trunk | Stable | Stable | N/A | N/A |
| Languages | Multi | Multi | English | English+ | English | 99 |
| Model family | Mistral | NVIDIA Canary | NVIDIA Parakeet | NVIDIA Canary | Useful Sensors | OpenAI |
| Best for | Live streaming | Long sessions | Minimal VRAM | Budget NVIDIA | Any hardware | Multilingual |
Bottom line: VoxBar Flash is the budget NVIDIA option — it brings Canary-family intelligence to users who can't spare 6-8GB of VRAM for the full 2.5B model but still want NVIDIA's accuracy advantage over Whisper and Moonshine. It's easier to install than VoxBar AI (no trunk, no special venv), more accurate than Whisper, and sits in the sweet spot for users with 4-6GB GPUs.