VoxBar Nemotron — Ultra-Lightweight Streaming Transcription

How It Works

NVIDIA's FastConformer architecture — designed for streaming speech recognition from the ground up.

🎤

Captures your audio

Audio is captured at 16kHz from your microphone — or switch to system audio mode to capture anything playing on your PC (meetings, videos, podcasts). No virtual cables needed.

⚡

FastConformer encoder

NVIDIA's FastConformer architecture processes audio with attention-based encoding — understanding context, not just sounds.

🧠

RNNT decoder streams text

The Recurrent Neural Network Transducer (RNNT) decoder produces text tokens in real-time — designed for streaming from day one.

✍️

Words appear as you speak

Transcription flows directly into your textbox with millisecond latency. Smart autocorrect cleans up spacing and formatting during natural pauses. The smallest, fastest engine in the VoxBar lineup.

Accuracy & Speed

Metric	Value
Arena Score	8.35 combined — Sys 8.7 / Mic 8.0
Architecture	FastConformer (24-layer encoder) + RNN-T decoder
Chunk Sizes	Configurable — 80ms, 160ms, 560ms, 1120ms
Language	English
Punctuation	Automatic — natively generated by the model
Capitalisation	Automatic, intelligent

Memory & Resource Footprint

Resource	Usage	Behaviour Over Time
GPU VRAM	~4.8GB (Nemotron 0.6B)	Stable — lightweight, barely touches your GPU
RAM	~1-2GB (Python process)	Stable
Disk	Zero temp files	Audio processed in memory, never written to disk
Network	None	Fully offline — no internet required

Recording Limits

♾️

No Recording Limit

VoxBar Nemotron processes each audio chunk independently — no state accumulates, no context window fills up. GPU memory stays fixed at ~4.8GB. Record for hours without interruption.

⏱️

Auto-Stop Behaviour

Silence timeout: 15 minutes (900 seconds) of no detected speech triggers auto-stop.

Why VoxBar Nemotron Is Different

What you DON'T need

✖No internet connection — everything runs locally

✖No cloud processing — your voice never leaves your machine

✖No Docker required — runs natively with Python + CUDA

✖No usage limits — unlimited transcription, forever

✖No subscriptions — one-time purchase, lifetime license

What makes it unique

✔NVIDIA-engineered — built by the team behind NeMo and CUDA

✔System audio capture — transcribe meetings, YouTube, podcasts directly from your PC's audio output

✔Ultra-lightweight — just 600M parameters, runs on 4.8GB VRAM

✔Streaming-native — FastConformer-RNNT was designed for real-time from day one

✔Low resource usage — barely touches your GPU, great for multitasking

✔Overlay Mode — transparent overlay sits on top of any app with adjustable transparency and font sizes

✔Mid-text editing — click anywhere in your text to insert new speech at that position

✔Voice commands — say "delete" to remove highlighted text, use voice punctuation and formatting

Hardware Requirements

Requirement	Minimum	Recommended
GPU (NVIDIA)	5GB VRAM	6GB+ VRAM
RAM	8GB	16GB
Disk	~2GB (model + app)	SSD
OS	Windows 10/11	Windows 11
Software	Python 3.10+ / CUDA	Included in installer

Note: VoxBar Nemotron requires an NVIDIA GPU with CUDA support. AMD and Apple Silicon are not currently supported.

License & Attribution

VoxBar™ Nemotron is powered by Nemotron Speech Streaming EN 0.6B, created by the NVIDIA NeMo team and released under the NVIDIA Open Model License.

VoxBar™ is an independent product by Conjure Labs Limited and is not affiliated with, endorsed by, or sponsored by NVIDIA Corporation.

Nemotron vs GLM

Feature	Nemotron	GLM
Arena Score	8.35 combined	8.0 combined
VRAM	~4.8GB	~4GB
Architecture	FastConformer + RNNT	LLM-based (generative)
Languages	English	17 languages
Best for	Highest mid-tier accuracy, NVIDIA-optimised	Multilingual needs, lightest GPU footprint

The lightest engine in the VoxBar lineup.

One-time purchase. Lifetime license. 2 machines. Zero cloud.

Coming Soon

Secure checkout via Lemon Squeezy / Stripe