How It Works
VoxBar GLM uses Z.ai's GLM-ASR-Nano-2512 — a 1.5 billion parameter generative speech recognition model built on the GLM (General Language Model) architecture. Unlike traditional CTC or encoder-only ASR models, GLM-ASR uses a multimodal chat-template API — it processes audio like a conversation, producing remarkably natural and contextually aware transcriptions.
Here's what happens, step by step:
- Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
- Buffers 5 seconds of audio into a small in-memory buffer
- Checks for silence — if the RMS energy is below 0.03, the chunk is skipped (tuned to filter ambient noise)
- Writes a tiny temp WAV file to your system temp folder
- Sends the WAV to the GLM-ASR subprocess via a multimodal chat message:
{"type": "audio", "url": "file.wav"} - The model generates a transcription using
model.generate()— like an LLM, it produces text token by token - Post-processing pipeline cleans the output: stutter removal → boundary dedup → hallucination filter → cross-chunk repeat detection
- Temp file is immediately deleted — nothing accumulates on disk
- Text is appended to your textbox
- Repeats forever — each chunk is independent
The key difference from traditional ASR models: GLM-ASR is a generative model. It doesn't just map audio to text — it "understands" the audio and produces natural language output. This gives it better handling of context, punctuation, and natural speech patterns, but it also means it can occasionally hallucinate creative descriptions of background noise (which our tuning filters address).
Recording Limits
VoxBar GLM Has No Recording Limit
VoxBar GLM runs natively on your machine with no Docker, no WebSocket, and no server process. Each 5-second chunk is completely independent — the model processes it, the temp file is deleted, and it moves on.
Why It Runs Forever
- Each chunk is self-contained — no state carries between chunks
- GPU memory is fixed at ~4GB — the model loads once and stays resident
- No network connections, no Docker, no server processes
- The subprocess architecture releases the GIL, so the UI stays responsive
Auto-Stop Behaviour
- Silence timeout: 15 minutes of no detected speech
- Check interval: Every 5 seconds
- Designed for long sessions — meetings, lectures, extended dictation
Memory & Resource Footprint
| Resource | Usage | Behaviour Over Time |
|---|---|---|
| GPU VRAM | ~4GB fixed | ✅ Never grows — loads once at startup |
| RAM | ~500MB (Python process + transformers) | ✅ Stable |
| Disk | Zero accumulation | ✅ Temp WAV files deleted immediately after each chunk |
| Network | None (after first model download) | ✅ Completely offline |
First launch downloads ~3GB of model weights from HuggingFace. After that, the model is cached locally and VoxBar GLM runs entirely offline.
Architecture Advantage
What makes VoxBar GLM special: It's a generative multimodal model, not a traditional speech recogniser. The same architecture that powers modern LLMs is applied to speech:
- Generative transcription — produces natural, contextually aware text with proper sentence structure
- 17-language support — English, Mandarin, Cantonese, Japanese, Korean, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Arabic, Hindi, Thai, Vietnamese, Indonesian
- Exceptional dialect handling — particularly strong on Cantonese and other Chinese dialects
- Low-volume speech robustness — specifically trained to capture quiet/whispered speech that other models miss
- Beats Whisper V3 — 4.10 average error rate across benchmarks
- BF16 precision — memory-efficient inference using bfloat16
What users DON'T have to worry about:
- ❌ No Docker required — runs natively
- ❌ No internet connection — completely offline (after first download)
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever
What users DO need to know:
- ⚠️ Text arrives in chunks (every ~5 seconds)
- ⚠️ 4GB VRAM required — runs on most NVIDIA GPUs (GTX 1060+)
- ⚠️ Generative model — slightly higher latency per chunk than CTC models, but more
accurate
- ⚠️ First launch downloads ~3GB model files (cached after that)
Accuracy & Speed
| Metric | Value |
|---|---|
| Delivery | Chunked — text appears every ~5 seconds |
| Latency | ~2-4 seconds processing time per chunk |
| Average Error Rate | 4.10 (across benchmarks — beats Whisper V3) |
| Punctuation | Yes — generated naturally by the model |
| Capitalisation | Yes — generated naturally by the model |
| Languages | 17 (English, Mandarin, Cantonese + 14 more) |
| Dialect Support | Exceptional — Cantonese, regional Chinese dialects |
| Low-Volume Speech | Specifically optimised — captures whispered audio |
Unique Strength: Whisper/Quiet Speech
GLM-ASR-Nano was specifically trained for low-volume audio scenarios. In testing, it
reliably transcribes speech that other models (including Whisper V3) fail to detect. This makes it
ideal for:
- Quiet office environments
- Soft-spoken users
- Distant microphone setups
- Meeting recordings with varying speaker volumes
Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA with 4GB VRAM | NVIDIA with 6GB+ VRAM |
| GPU (AMD) | ❌ Not supported | — |
| GPU (Apple) | ❌ Not supported | — |
| RAM | 16GB | 16GB+ |
| Disk | ~3GB for model (cached in ~/.cache) | SSD recommended |
| OS | Windows 10/11 | Windows 11 |
| Software | Python 3.10+ / CUDA | Included in installer |
| Docker | ❌ Not required | — |
License & Attribution
| Detail | Value |
|---|---|
| Model | zai-org/GLM-ASR-Nano-2512 |
| Creator | Z.ai (Zhipu AI) |
| License | Apache 2.0 (commercially usable, no attribution required) |
| Attribution | Not required |
| Distribution | Can be bundled and sold commercially |
Where It Fits in the Suite
| Feature | VoxBar Pro | VoxBar Kyutai 2.6B | VoxBar GLM |
|---|---|---|---|
| Accuracy | ★★★★★ | ★★★★★ | ★★★★★ |
| Text delivery | Real-time | Every ~4s | Every ~5s |
| VRAM | ~8-10GB | ~5.4GB | ~4GB |
| Docker | Yes | No | No |
| Languages | Multi | English only | 17 languages |
| Dialect support | Limited | None | Exceptional (Cantonese, Chinese dialects) |
| Quiet speech | Standard | Standard | Specifically optimised |
| Best for | Live presentations | Meeting transcription | Multilingual / dialect / quiet speech |
Bottom line: VoxBar GLM is the multilingual accuracy champion. If you work across languages — especially Chinese, Cantonese, Japanese, Korean, or any of the 17 supported languages — VoxBar GLM delivers Whisper-beating accuracy with exceptional dialect handling and whispered speech capture. It's the model you reach for when standard ASR can't hear what was said.