A Practical Guide to Cheap and Local AI: OpenRouter, Gateways, and Local Models

Part 1 was the argument:
price is a feature, most tasks don't need a frontier model, and the skill is right-sizing the model
to the job. This part is the how — the practical stack for living that way without re-plumbing your
workflow every week. None of this is exotic; it's the setup a lot of us have quietly converged on.

Think of it as three tiers, cheapest-effort to most-control: aggregators, a gateway, and
fully local. You'll probably use all three.

Tier 1: OpenRouter (and friends) — one key, every model

The fastest way to get off a single expensive default is an aggregator. OpenRouter is the obvious
one: a single API key and a single OpenAI-compatible endpoint that fronts hundreds of models — open
and closed, cheap and frontier — so you can switch models by changing a string instead of rewriting
integration code.

Two things make it a cheapskate's best friend:

Free and near-free models. At any given time OpenRouter carries a rotating set of models you can
call for $0 (often the big new open-weight releases, like a 120B-class model, offered free while
they're fresh), plus a long list of small models priced in cents per million tokens.
It's OpenAI-compatible. Anything that speaks the OpenAI API speaks OpenRouter. That's the whole
trick — you point your existing client at a different base URL.

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen3-coder",
    "messages": [{"role": "user", "content": "Refactor this function to be pure."}]
  }'

Fireworks, Together, Base10 and similar providers play the same role with different tradeoffs (some let
you tune weights or host specific open models), but the pattern is identical: one compatible endpoint,
many models, pay per token.

Point your coding tools at it (BYOK)

The reason this matters for daily work: most agentic coding tools now support bring-your-own-key
and custom model endpoints. So you can keep your harness — the IDE integration, the agent loop, the
slash commands — and swap the brain underneath for something cheap. Wire the tool's "custom/OpenAI-
compatible provider" setting to OpenRouter's base URL and your key, and suddenly your premium coding
assistant is driving a free 120B model or a two-cents-per-task Flash model. That single move is where
most of the savings live.

Tier 2: A self-hosted gateway — one front door for everything

Once you're juggling more than one provider and more than one tool, you'll want a gateway: a small
proxy that sits in front of all your model providers and exposes one OpenAI-compatible endpoint to all
your apps. LiteLLM is the common choice (an open-source proxy), but the concept matters more than the
specific tool.

Why bother, when OpenRouter already aggregates? Because a gateway you control gives you things an
external aggregator can't:

One key to rotate, one place to revoke. Your apps hold a key to your gateway, not to five
upstream vendors. Leak one, rotate one.
Central routing, budgets, and logging. Set spend limits, see exactly what each app costs, and
change routing rules in one config instead of redeploying every client.
A single egress point. All your AI traffic leaves through one door — easier to secure, meter, and
reason about.
Mix local and hosted behind the same URL. Your gateway can route some model names to OpenRouter
and others to a local runner, and your apps never know the difference.

A minimal litellm config that blends a hosted cheap model, a frontier model, and a local one behind
one endpoint looks roughly like this:

model_list:
  - model_name: cheap            # the daily driver
    litellm_params:
      model: openrouter/qwen/qwen3-coder
      api_key: os.environ/OPENROUTER_API_KEY
  - model_name: smart            # break glass for hard problems
    litellm_params:
      model: anthropic/claude-opus-4-8
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: local            # runs on your machine, $0/token
    litellm_params:
      model: ollama/qwen3-coder:30b
      api_base: http://localhost:11434

Now every app points at the gateway and just asks for cheap, smart, or local. Swapping what those
names mean is a one-line config change, not a code change.

Tier 3: Fully local — $0 per token, your hardware

The deepest tier is running the model yourself. No per-token bill, no network round-trip, no data
leaving the machine — which is exactly why the obsession exists. The two easiest on-ramps:

Ollama — the simplest. ollama pull qwen3-coder:30b then ollama run. It exposes an
OpenAI-compatible endpoint on localhost:11434, so everything above just works against it.
LM Studio — a GUI for discovering, downloading, and serving models, also with an OpenAI-compatible
local server. Nice for browsing what fits your RAM.
On Apple Silicon, MLX gives you the fastest native path for the open-weight models; on a beefy
NVIDIA box, llama.cpp / vLLM cover the same ground.

The gate is memory. A ~30B coding model is very usable on a machine with enough unified memory; the
100B-plus models that rival hosted flagships need the high-end unified-memory hardware (128GB-class
Apple Silicon or the new NVIDIA mini-machines). Match the model to your RAM, not your ambitions —
a quantized model that fits comfortably beats a bigger one that swaps itself to death.

Routing: send each task to the cheapest model that can do it

The tiers are plumbing. The strategy is routing — and you don't need anything clever to start. A
simple, durable policy:

Default to cheap/local. Scaffolding, refactors, known patterns, writing PRDs, planning, mechanical
edits — let the cheap Flash-class or local model do it. This is most of your day.
Escalate deliberately. Reach for the frontier model when the task is genuinely hard — gnarly
debugging, unfamiliar/modern APIs the small models fumble, architecture decisions. Make it a choice,
not a reflex.
Use the expensive model as a reviewer, not the author. A pattern I like: let a cheap model write
the code, then spend a few cents having a stronger model review the diff. You pay frontier prices
only for judgment on a small surface, not for generating every token.
Research first for weak spots. If a small model is bad at something (say, the newest framework
APIs), do a research/docs pass first and feed it the findings — cheap context beats expensive
guessing.

Done well, a long, hands-on session — research, plan, implement, review — can come in at a couple of
dollars instead of twenty or forty, with the expensive model touching only the parts that needed it.

Before you trust a cheap model with real work, prove it on your tasks — not a leaderboard. The only
benchmark I trust at this point is a blind A/B:

Take a real prompt from your actual work.
Run it 5 times on model A, 5 times on model B, saving all 10 outputs.
Strip the labels, look at the results cold, and pick the ones you'd actually ship.
Only then look at which model produced your favorites.

Do this once for the kind of task you do most, and you'll know — concretely, for your codebase —
whether the cheap model clears the bar. Re-run it when a shiny new model drops, since they appear
weekly now. (And remember the tokenizer caveat from Part 1: "fewer tokens" is only a fair comparison
when two models share a tokenizer.)

The whole stack, in one breath

Point your tools at OpenRouter with bring-your-own-key to escape a single expensive default; put a
self-hosted gateway in front when you're juggling providers, so you have one key, central budgets,
and one egress; run local models via Ollama/LM Studio/MLX for the work that should cost nothing and
stay on your machine; route each task to the cheapest model that can do it and use the frontier
model as an escalation and a reviewer; and blind-eval before you trust anything. That's the entire
cheap-and-local playbook — no lock-in, mostly cents, and a lot less of that little voice telling you
not to hit Enter.

Previous: Price Is a Feature — the case for cheap and local AI models.