The Rise of On-Device AI: Models Built to Run on Your Hardware
The smartest AI doesn't have to live in a data centre. A new generation of models is designed from the ground up to run on your laptop, your desktop, even your phone — privately, offline, and without permission from anyone.
The Shift Nobody Predicted
Two years ago, the AI industry was unanimous: bigger is better. More parameters, more GPUs, more data centres. The assumption was that useful AI would always live in the cloud, accessed through subscriptions, processed on hardware you'd never see.
That assumption is crumbling. In February 2026, some of the most popular models on Ollama are specifically designed to run on consumer hardware — your laptop, your desktop, your workstation. Not as a compromise. By design.
How Models Got Small Enough
Three breakthroughs made on-device AI possible:
- Mixture of Experts (MoE) — pioneered by DeepSeek, this approach means a model can have hundreds of billions of parameters but only activate a fraction of them for each task. Big brain, small footprint.
- Quantisation — reducing the precision of model weights from 32-bit to 4-bit or even 2-bit. The model takes up 4-8x less memory with minimal quality loss. A 14B model that once needed 28 GB now fits in 6 GB.
- Distillation — training small models to mimic the behaviour of large ones. The "student" model learns the "teacher's" reasoning patterns, capturing most of the intelligence in a fraction of the size.
Combined, these techniques mean that a 1-7 billion parameter model in 2026 can outperform models ten times its size from 2024. The intelligence hasn't shrunk — the packaging has.
Models Leading the Charge
Several models are purpose-built for on-device use:
- LFM2.5-Thinking (Liquid AI) — a hybrid model family starting at just 1.2B parameters. Supports tool use and step-by-step reasoning on almost any hardware. Already has 65K+ pulls on Ollama.
- GLM-4.7-Flash (Z.ai) — the strongest model in the 30B class, with 227K+ pulls. Runs on 8-12 GB GPUs and supports both thinking and tool modes.
- Phi-3 (Microsoft) — a small, efficient model designed for laptops and edge devices. Runs comfortably on 4 GB of VRAM.
- Gemma 2 (Google) — lightweight models from 2B to 9B parameters, optimised for consumer GPUs and on-device deployment.
These aren't stripped-down versions of cloud models. They're architecturally designed from the ground up to deliver maximum capability within the constraints of hardware you already own.
Why On-Device Changes Everything
Running AI on your own hardware isn't just a technical flex. It fundamentally changes the relationship between you and your tools:
- Privacy by architecture — your data never leaves your machine. Not because a company promises it won't, but because there's physically nowhere for it to go. No network connection, no upload, no server.
- Zero latency — no waiting for a server across the world to process your request. The model runs on your GPU, inches from your screen. Responses start instantly.
- No ongoing cost — the model is a file on your hard drive. You download it once, it's yours forever. No subscription, no usage limits, no price increases.
- Works offline — on a plane, in a dead zone, behind a corporate firewall. Your AI works wherever you do, because it lives on your device.
- No gatekeeping — nobody can revoke your access, change the model's behaviour, or decide what topics it can discuss. The file on your computer is yours.
Vox Bar: On-Device AI in Practice
This isn't a future trend — it's already happening. Vox Bar is a real-world example of on-device AI: the Voxtral transcription model runs entirely on your GPU. Your voice is processed locally, the text appears on screen, and the audio never touches a network connection.
It works with any application through Overlay — a floating interface that brings voice input to code editors, Office apps, and anything else on your desktop. No plugin required, no cloud dependency, no subscription.
The Bottom Line
The AI industry split into two paths: cloud companies building ever-larger models funded by your subscription, and open-source teams building ever-more-efficient models that run on hardware you already own. Both paths deliver genuinely useful AI — but only one lets you own it.
On-device AI isn't a compromise. It's a choice. And with each month, the models get smaller, smarter, and more capable. The question isn't whether AI will run locally — it's how long until that's the default.
On-device AI, right now
Vox Bar: transcription that runs on your GPU. No cloud. No subscription. Just speak.
Coming Soon Early Bird