VoxBar GLM — Multilingual ASR with GLM-ASR-Nano 1.5B

How It Works

VoxBar GLM uses Z.ai's GLM-ASR-Nano-2512 — a 1.5 billion parameter generative speech recognition model built on the GLM (General Language Model) architecture. Unlike traditional CTC or encoder-only ASR models, GLM-ASR uses a multimodal chat-template API — it processes audio like a conversation, producing remarkably natural and contextually aware transcriptions.

Here's what happens, step by step:

Opens your microphone via sounddevice — captures audio at 16kHz, 1024-sample blocks
Buffers 5 seconds of audio into a small in-memory buffer
Checks for silence — if the RMS energy is below 0.03, the chunk is skipped (tuned to filter ambient noise)
Writes a tiny temp WAV file to your system temp folder
Sends the WAV to the GLM-ASR subprocess via a multimodal chat message: {"type": "audio", "url": "file.wav"}
The model generates a transcription using model.generate() — like an LLM, it produces text token by token
Post-processing pipeline cleans the output: stutter removal → boundary dedup → hallucination filter → cross-chunk repeat detection
Temp file is immediately deleted — nothing accumulates on disk
Text is appended to your textbox
Repeats forever — each chunk is independent

The key difference from traditional ASR models: GLM-ASR is a generative model. It doesn't just map audio to text — it "understands" the audio and produces natural language output. This gives it better handling of context, punctuation, and natural speech patterns, but it also means it can occasionally hallucinate creative descriptions of background noise (which our tuning filters address).

Recording Limits

VoxBar GLM Has No Recording Limit

VoxBar GLM runs natively on your machine with no Docker, no WebSocket, and no server process. Each 5-second chunk is completely independent — the model processes it, the temp file is deleted, and it moves on.

Why It Runs Forever

Each chunk is self-contained — no state carries between chunks
GPU memory is fixed at ~4GB — the model loads once and stays resident
No network connections, no Docker, no server processes
The subprocess architecture releases the GIL, so the UI stays responsive

Auto-Stop Behaviour

Silence timeout: 15 minutes of no detected speech
Check interval: Every 5 seconds
Designed for long sessions — meetings, lectures, extended dictation

Memory & Resource Footprint

Resource	Usage	Behaviour Over Time
GPU VRAM	~4GB fixed	✅ Never grows — loads once at startup
RAM	~500MB (Python process + transformers)	✅ Stable
Disk	Zero accumulation	✅ Temp WAV files deleted immediately after each chunk
Network	None (after first model download)	✅ Completely offline

First launch downloads ~3GB of model weights from HuggingFace. After that, the model is cached locally and VoxBar GLM runs entirely offline.

Architecture Advantage

What makes VoxBar GLM special: It's a generative multimodal model, not a traditional speech recogniser. The same architecture that powers modern LLMs is applied to speech:

Generative transcription — produces natural, contextually aware text with proper sentence structure
17-language support — English, Mandarin, Cantonese, Japanese, Korean, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Arabic, Hindi, Thai, Vietnamese, Indonesian
Exceptional dialect handling — particularly strong on Cantonese and other Chinese dialects
Low-volume speech robustness — specifically trained to capture quiet/whispered speech that other models miss
Beats Whisper V3 — 4.10 average error rate across benchmarks
BF16 precision — memory-efficient inference using bfloat16

What users DON'T have to worry about:
- ❌ No Docker required — runs natively
- ❌ No internet connection — completely offline (after first download)
- ❌ No API keys — the model runs locally
- ❌ No usage limits — unlimited transcription, forever

What users DO need to know:
- ⚠️ Text arrives in chunks (every ~5 seconds)
- ⚠️ 4GB VRAM required — runs on most NVIDIA GPUs (GTX 1060+)
- ⚠️ Generative model — slightly higher latency per chunk than CTC models, but more accurate
- ⚠️ First launch downloads ~3GB model files (cached after that)

Accuracy & Speed

Metric	Value
Delivery	Chunked — text appears every ~5 seconds
Latency	~2-4 seconds processing time per chunk
Average Error Rate	4.10 (across benchmarks — beats Whisper V3)
Punctuation	Yes — generated naturally by the model
Capitalisation	Yes — generated naturally by the model
Languages	17 (English, Mandarin, Cantonese + 14 more)
Dialect Support	Exceptional — Cantonese, regional Chinese dialects
Low-Volume Speech	Specifically optimised — captures whispered audio

Unique Strength: Whisper/Quiet Speech

GLM-ASR-Nano was specifically trained for low-volume audio scenarios. In testing, it reliably transcribes speech that other models (including Whisper V3) fail to detect. This makes it ideal for:
- Quiet office environments
- Soft-spoken users
- Distant microphone setups
- Meeting recordings with varying speaker volumes

Hardware Requirements

Requirement	Minimum	Recommended
GPU	NVIDIA with 4GB VRAM	NVIDIA with 6GB+ VRAM
GPU (AMD)	❌ Not supported	—
GPU (Apple)	❌ Not supported	—
RAM	16GB	16GB+
Disk	~3GB for model (cached in ~/.cache)	SSD recommended
OS	Windows 10/11	Windows 11
Software	Python 3.10+ / CUDA	Included in installer
Docker	❌ Not required	—

License & Attribution

Detail	Value
Model	zai-org/GLM-ASR-Nano-2512
Creator	Z.ai (Zhipu AI)
License	Apache 2.0 (commercially usable, no attribution required)
Attribution	Not required
Distribution	Can be bundled and sold commercially

Where It Fits in the Suite

Feature	VoxBar Pro	VoxBar Kyutai 2.6B	VoxBar GLM
Accuracy	★★★★★	★★★★★	★★★★★
Text delivery	Real-time	Every ~4s	Every ~5s
VRAM	~8-10GB	~5.4GB	~4GB
Docker	Yes	No	No
Languages	Multi	English only	17 languages
Dialect support	Limited	None	Exceptional (Cantonese, Chinese dialects)
Quiet speech	Standard	Standard	Specifically optimised
Best for	Live presentations	Meeting transcription	Multilingual / dialect / quiet speech

Bottom line: VoxBar GLM is the multilingual accuracy champion. If you work across languages — especially Chinese, Cantonese, Japanese, Korean, or any of the 17 supported languages — VoxBar GLM delivers Whisper-beating accuracy with exceptional dialect handling and whispered speech capture. It's the model you reach for when standard ASR can't hear what was said.