⚡ 8.0 Arena — Multilingual Specialist

VoxBar GLM

Whisper-beating accuracy across 17 languages. Hears what other models miss.

$39 one-time · 🔥 LAUNCH: $19.50 with code EARLYBIRD

🎤 Transcribe your mic — or 🔊 listen to your system audio (meetings, podcasts, videos). All 100% local. Nothing leaves your machine.

How It Works

VoxBar GLM uses Z.ai's GLM-ASR-Nano-2512 — a 1.5 billion parameter generative speech recognition model built on the GLM (General Language Model) architecture. Unlike traditional CTC or encoder-only ASR models, GLM-ASR uses a multimodal chat-template API — it processes audio like a conversation, producing remarkably natural and contextually aware transcriptions.

Here's what happens, step by step:

  1. Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
  2. Buffers 5 seconds of audio into a small in-memory buffer
  3. Checks for silence — if the RMS energy is below 0.03, the chunk is skipped (tuned to filter ambient noise)
  4. Writes a tiny temp WAV file to your system temp folder
  5. Sends the WAV to the GLM-ASR subprocess via a multimodal chat message: {"type": "audio", "url": "file.wav"}
  6. The model generates a transcription using model.generate() — like an LLM, it produces text token by token
  7. Post-processing pipeline cleans the output: stutter removal → boundary dedup → hallucination filter → cross-chunk repeat detection
  8. Temp file is immediately deleted — nothing accumulates on disk
  9. Text is appended to your textbox
  10. Repeats forever — each chunk is independent

The key difference from traditional ASR models: GLM-ASR is a generative model. It doesn't just map audio to text — it "understands" the audio and produces natural language output. This gives it better handling of context, punctuation, and natural speech patterns, but it also means it can occasionally hallucinate creative descriptions of background noise (which our tuning filters address).

Recording Limits

VoxBar GLM Has No Recording Limit

VoxBar GLM runs natively on your machine with no Docker, no WebSocket, and no server process. Each 5-second chunk is completely independent — the model processes it, the temp file is deleted, and it moves on.

Why It Runs Forever

  • Each chunk is self-contained — no state carries between chunks
  • GPU memory is fixed at ~4GB — the model loads once and stays resident
  • No network connections, no Docker, no server processes
  • The subprocess architecture releases the GIL, so the UI stays responsive

Auto-Stop Behaviour

  • Silence timeout: 15 minutes of no detected speech
  • Check interval: Every 5 seconds
  • Designed for long sessions — meetings, lectures, extended dictation

Memory & Resource Footprint

Resource Usage Behaviour Over Time
GPU VRAM ~4GB fixed ✅ Never grows — loads once at startup
RAM ~500MB (Python process + transformers) ✅ Stable
Disk Zero accumulation ✅ Temp WAV files deleted immediately after each chunk
Network None (after first model download) ✅ Completely offline

First launch downloads ~3GB of model weights from HuggingFace. After that, the model is cached locally and VoxBar GLM runs entirely offline.

Architecture Advantage

What makes VoxBar GLM special: It's a generative multimodal model, not a traditional speech recogniser. The same architecture that powers modern LLMs is applied to speech:

  • Generative transcription — produces natural, contextually aware text with proper sentence structure
  • 17-language support — English, Mandarin, Cantonese, Japanese, Korean, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Arabic, Hindi, Thai, Vietnamese, Indonesian
  • Exceptional dialect handling — particularly strong on Cantonese and other Chinese dialects
  • Low-volume speech robustness — specifically trained to capture quiet/whispered speech that other models miss
  • Beats Whisper V3 — 4.10 average error rate across benchmarks
  • BF16 precision — memory-efficient inference using bfloat16

What users DON'T have to worry about:
- ❌ No Docker required — runs natively
- ❌ No internet connection — completely offline (after first download)
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever

What users DO need to know:
- ⚠️ Text arrives in chunks (every ~5 seconds)
- ⚠️ 4GB VRAM required — runs on most NVIDIA GPUs (GTX 1060+)
- ⚠️ Generative model — slightly higher latency per chunk than CTC models, but more accurate
- ⚠️ First launch downloads ~3GB model files (cached after that)

Accuracy & Speed

Metric Value
Delivery Chunked — text appears every ~5 seconds
Latency ~2-4 seconds processing time per chunk
Average Error Rate 4.10 (across benchmarks — beats Whisper V3)
Punctuation Yes — generated naturally by the model
Capitalisation Yes — generated naturally by the model
Languages 17 (English, Mandarin, Cantonese + 14 more)
Dialect Support Exceptional — Cantonese, regional Chinese dialects
Low-Volume Speech Specifically optimised — captures whispered audio

Unique Strength: Whisper/Quiet Speech

GLM-ASR-Nano was specifically trained for low-volume audio scenarios. In testing, it reliably transcribes speech that other models (including Whisper V3) fail to detect. This makes it ideal for:
- Quiet office environments
- Soft-spoken users
- Distant microphone setups
- Meeting recordings with varying speaker volumes

Hardware Requirements

Requirement Minimum Recommended
GPU NVIDIA with 4GB VRAM NVIDIA with 6GB+ VRAM
GPU (AMD) ❌ Not supported
GPU (Apple) ❌ Not supported
RAM 16GB 16GB+
Disk ~3GB for model (cached in ~/.cache) SSD recommended
OS Windows 10/11 Windows 11
Software Python 3.10+ / CUDA Included in installer
Docker ❌ Not required

License & Attribution

Detail Value
Model zai-org/GLM-ASR-Nano-2512
Creator Z.ai (Zhipu AI)
License Apache 2.0 (commercially usable, no attribution required)
Attribution Not required
Distribution Can be bundled and sold commercially

Where It Fits in the Suite

Feature VoxBar Pro VoxBar Kyutai 2.6B VoxBar GLM
Accuracy ★★★★★ ★★★★★ ★★★★★
Text delivery Real-time Every ~4s Every ~5s
VRAM ~8-10GB ~5.4GB ~4GB
Docker Yes No No
Languages Multi English only 17 languages
Dialect support Limited None Exceptional (Cantonese, Chinese dialects)
Quiet speech Standard Standard Specifically optimised
Best for Live presentations Meeting transcription Multilingual / dialect / quiet speech

Bottom line: VoxBar GLM is the multilingual accuracy champion. If you work across languages — especially Chinese, Cantonese, Japanese, Korean, or any of the 17 supported languages — VoxBar GLM delivers Whisper-beating accuracy with exceptional dialect handling and whispered speech capture. It's the model you reach for when standard ASR can't hear what was said.