The case for a 3B model on your desk
Most teams reach for a frontier API when a small model — running on the laptop of the person who actually uses the feature — would have shipped six months sooner and cost nothing per request.
There is a quiet category of problem that does not need a trillion parameters. Classifying support tickets. Rewriting a sentence in a house style. Pulling structured data out of a PDF. We keep watching teams burn a quarter wiring an inference platform around a frontier model for jobs that a 3B parameter model, quantised to fit in 4GB, would do in 40 milliseconds on the user's own machine.
What changes when latency is free
If the model lives on the laptop, you stop budgeting for tokens and start budgeting for taste. You can call the model on every keystroke. You can run three prompts in parallel and pick the best one. You can let the user disagree with it, in real time, without an awkward spinner.
The interesting question stops being 'is the model smart enough?' and becomes 'what would you build if it were free?'
Where it breaks
- Anything that benefits from world knowledge the model was never told.
- Agentic loops where the model has to reason over many novel tools.
- Compliance regimes that require the model to live in a specific region.
Outside those three, the calculus is different than it was a year ago. Try the small one first. It is almost never the model that's the bottleneck.
