You Don't Need a GPU for Speech: Self-Hosting Whisper and TTS on a CPU VPS

Every time speech comes up — transcribe this voice note, give the agent a voice — the reflex is the same: we're going to need a GPU for that. I had the reflex too. Then I actually checked what my cheap VPS boxes could do, and the answer surprised me enough to write it down.

For real-time-ish speech on a server — transcribing voice notes, giving an assistant a voice — you mostly don't need a GPU at all. You need to know one thing about your CPU, and it's probably already true.

The thing that actually matters isn't CUDA — it's AVX2

Here's the myth-buster. The bottleneck for CPU speech isn't "do I have a graphics card." It's whether your CPU has the right SIMD instruction sets — specifically AVX2, plus FMA and F16C. These let the CPU chew through many numbers per clock, which is exactly the shape of work a speech model does.

Any half-decent server CPU from the last several years has them. Before you plan any of this, check:

grep -oE 'avx2|fma|f16c' /proc/cpuinfo | sort -u
# avx2
# f16c
# fma

Three lines back? You're done shopping. Both of the boxes I run this on — an 8-core and an 18-core EPYC VPS, no GPU anywhere — have all three. That's the whole prerequisite. The engines below are built around these instructions; the GPU was never the point.

STT — faster-whisper on CPU, int8

For speech-to-text, skip vanilla OpenAI Whisper and use faster-whisper. It's the same Whisper model reimplemented on CTranslate2, an inference engine tuned hard for CPUs with — you guessed it — AVX2/FMA. The trick that makes it fly is int8 quantization: the weights get squeezed from 16-bit floats to 8-bit integers, which is smaller in RAM and faster on integer-happy CPU cores, at a barely-perceptible accuracy cost.

The number that matters is real-time factor (RTF) — seconds of compute per second of audio. On 8 AVX2 cores with the small model at int8, you're looking at an RTF around 0.1–0.4: a 30-second voice note transcribes in roughly 3–10 seconds. RAM is trivial — base needs about 1 GB, small about 1.5 GB. A big box (18 cores) can even run medium or large-v3 at int8 and still stay usable if you want top accuracy.

Pick your rung by taste: base for speed, small for the sweet spot, medium/large-v3 when accuracy matters more than latency.

TTS — Piper, and it's basically instant

For text-to-speech, Piper is the CPU pick. It was purpose-built for CPU — the voices are tiny (~60 MB each) and it runs faster than real-time even on a Raspberry Pi. On real server cores it's effectively instant, and it ships English and Spanish voices out of the box.

If you want nicer prosody and can spend a few seconds per sentence, Kokoro also runs on CPU. The heavier, more expressive engines — XTTS, F5, VibeVoice and friends — genuinely want a GPU, so I leave those off the VPS. For notifications, assistant replies, and read-backs, Piper is more than good enough and never makes you wait.

Wrap both as OpenAI-compatible containers

The move that makes this painless: don't invent a new API. Package each engine as a Docker container that speaks the OpenAI audio API — /v1/audio/transcriptions for STT, the speech endpoint for TTS. Now anything in your stack that already knows how to talk to OpenAI talks to your box instead, by changing a base URL. No client rewrites, no vendor SDK lock-in, and no API key leaving your server.

The quietly better part: co-locate, don't mesh

Here's the bit I didn't expect. For audio specifically, the CPU-VPS approach is simpler than the GPU alternatives, not just cheaper.

Because these are lightweight CPU containers, you run them right next to whatever needs them — the backend, the agent, the app server. They call localhost. That buys you three things a GPU box in another rack (or a GPU in your closet) can't:

Zero network latency — the audio never leaves the machine.
Always-on — a server doesn't sleep the way a workstation does; no "wake the GPU machine first."
No key, no egress — voice data stays on the box you already trust.

A GPU is the right call when you're batch-transcribing thousands of hours, or you need the heavyweight expressive voices, or you're doing low-latency streaming at scale. For a normal app that needs to turn voice notes into text and give an assistant something to say? A CPU VPS with AVX2 solves it completely — and you already own the hardware.

So before you spec a GPU for speech, run that one grep. If AVX2 comes back, you were done before you started.