A Practical Guide to Cheap and Local AI: OpenRouter, Gateways, and Local Models

Part 1 was the argument:
price is a feature, most tasks don't need a frontier model, and the skill is right-sizing the model
to the job. This part is the how — the practical stack for living that way without re-plumbing your
workflow every week. None of this is exotic; it's the setup a lot of us have quietly converged on.
Think of it as three tiers, cheapest-effort to most-control: aggregators, a gateway, and
fully local. You'll probably use all three.
Tier 1: OpenRouter (and friends) — one key, every model
The fastest way to get off a single expensive default is an aggregator. OpenRouter is the obvious
one: a single API key and a single OpenAI-compatible endpoint that fronts hundreds of models — open
and closed, cheap and frontier — so you can switch models by changing a string instead of rewriting
integration code.
Two things make it a cheapskate's best friend:
- Free and near-free models. At any given time OpenRouter carries a rotating set of models you can
call for $0 (often the big new open-weight releases, like a 120B-class model, offered free while
they're fresh), plus a long list of small models priced in cents per million tokens. - It's OpenAI-compatible. Anything that speaks the OpenAI API speaks OpenRouter. That's the whole
trick — you point your existing client at a different base URL.
curl https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/qwen3-coder",
"messages": [{"role": "user", "content": "Refactor this function to be pure."}]
}'
Fireworks, Together, Base10 and similar providers play the same role with different tradeoffs (some let
you tune weights or host specific open models), but the pattern is identical: one compatible endpoint,
many models, pay per token.
Point your coding tools at it (BYOK)
The reason this matters for daily work: most agentic coding tools now support bring-your-own-key
and custom model endpoints. So you can keep your harness — the IDE integration, the agent loop, the
slash commands — and swap the brain underneath for something cheap. Wire the tool's "custom/OpenAI-
compatible provider" setting to OpenRouter's base URL and your key, and suddenly your premium coding
assistant is driving a free 120B model or a two-cents-per-task Flash model. That single move is where
most of the savings live.
Tier 2: A self-hosted gateway — one front door for everything
Once you're juggling more than one provider and more than one tool, you'll want a gateway: a small
proxy that sits in front of all your model providers and exposes one OpenAI-compatible endpoint to all
your apps. LiteLLM is the common choice (an open-source proxy), but the concept matters more than the
specific tool.
Why bother, when OpenRouter already aggregates? Because a gateway you control gives you things an
external aggregator can't:
- One key to rotate, one place to revoke. Your apps hold a key to your gateway, not to five
upstream vendors. Leak one, rotate one. - Central routing, budgets, and logging. Set spend limits, see exactly what each app costs, and
change routing rules in one config instead of redeploying every client. - A single egress point. All your AI traffic leaves through one door — easier to secure, meter, and
reason about. - Mix local and hosted behind the same URL. Your gateway can route some model names to OpenRouter
and others to a local runner, and your apps never know the difference.
A minimal litellm config that blends a hosted cheap model, a frontier model, and a local one behind
one endpoint looks roughly like this:
model_list:
- model_name: cheap # the daily driver
litellm_params:
model: openrouter/qwen/qwen3-coder
api_key: os.environ/OPENROUTER_API_KEY
- model_name: smart # break glass for hard problems
litellm_params:
model: anthropic/claude-opus-4-8
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: local # runs on your machine, $0/token
litellm_params:
model: ollama/qwen3-coder:30b
api_base: http://localhost:11434
Now every app points at the gateway and just asks for cheap, smart, or local. Swapping what those
names mean is a one-line config change, not a code change.
Tier 3: Fully local — $0 per token, your hardware
The deepest tier is running the model yourself. No per-token bill, no network round-trip, no data
leaving the machine — which is exactly why the obsession exists. The two easiest on-ramps:
- Ollama — the simplest.
ollama pull qwen3-coder:30bthenollama run. It exposes an
OpenAI-compatible endpoint onlocalhost:11434, so everything above just works against it. - LM Studio — a GUI for discovering, downloading, and serving models, also with an OpenAI-compatible
local server. Nice for browsing what fits your RAM. - On Apple Silicon, MLX gives you the fastest native path for the open-weight models; on a beefy
NVIDIA box, llama.cpp / vLLM cover the same ground.
The gate is memory. A ~30B coding model is very usable on a machine with enough unified memory; the
100B-plus models that rival hosted flagships need the high-end unified-memory hardware (128GB-class
Apple Silicon or the new NVIDIA mini-machines). Match the model to your RAM, not your ambitions —
a quantized model that fits comfortably beats a bigger one that swaps itself to death.
Routing: send each task to the cheapest model that can do it
The tiers are plumbing. The strategy is routing — and you don't need anything clever to start. A
simple, durable policy:
- Default to cheap/local. Scaffolding, refactors, known patterns, writing PRDs, planning, mechanical
edits — let the cheap Flash-class or local model do it. This is most of your day. - Escalate deliberately. Reach for the frontier model when the task is genuinely hard — gnarly
debugging, unfamiliar/modern APIs the small models fumble, architecture decisions. Make it a choice,
not a reflex. - Use the expensive model as a reviewer, not the author. A pattern I like: let a cheap model write
the code, then spend a few cents having a stronger model review the diff. You pay frontier prices
only for judgment on a small surface, not for generating every token. - Research first for weak spots. If a small model is bad at something (say, the newest framework
APIs), do a research/docs pass first and feed it the findings — cheap context beats expensive
guessing.
Done well, a long, hands-on session — research, plan, implement, review — can come in at a couple of
dollars instead of twenty or forty, with the expensive model touching only the parts that needed it.
Verify it's good enough: the blind eval
Before you trust a cheap model with real work, prove it on your tasks — not a leaderboard. The only
benchmark I trust at this point is a blind A/B:
- Take a real prompt from your actual work.
- Run it 5 times on model A, 5 times on model B, saving all 10 outputs.
- Strip the labels, look at the results cold, and pick the ones you'd actually ship.
- Only then look at which model produced your favorites.
Do this once for the kind of task you do most, and you'll know — concretely, for your codebase —
whether the cheap model clears the bar. Re-run it when a shiny new model drops, since they appear
weekly now. (And remember the tokenizer caveat from Part 1: "fewer tokens" is only a fair comparison
when two models share a tokenizer.)
The whole stack, in one breath
Point your tools at OpenRouter with bring-your-own-key to escape a single expensive default; put a
self-hosted gateway in front when you're juggling providers, so you have one key, central budgets,
and one egress; run local models via Ollama/LM Studio/MLX for the work that should cost nothing and
stay on your machine; route each task to the cheapest model that can do it and use the frontier
model as an escalation and a reviewer; and blind-eval before you trust anything. That's the entire
cheap-and-local playbook — no lock-in, mostly cents, and a lot less of that little voice telling you
not to hit Enter.
Previous: Price Is a Feature — the case for cheap and local AI models.