Local AI Studio — Part 6: Giving the Reel a Soundtrack (Local Music Generation)

Part 5
ended with a 15-second reel that, in its first cut, had no sound at all. Fine for a benchmark,
but a reel without sound is half a reel. (I've since scored that beach reel too — with a bossa
nova, in the Tom Jobim spirit — but let's start from scratch.) So the obvious next move:

If I'm generating the picture locally, can I generate the music locally too — and glue them together without leaving the machine?

Yes. Here's a second reel — the Pacific coast this time, the sun and the water as the main
characters — generated and scored entirely on the Mac Studio, no cloud, no API bill:

The picture is the same pipeline as Part 5
(six SDXL stills → Ken Burns → crossfades). This post is about the sound.

How music generation works

Conceptually it's identical to image generation: a model takes a text prompt and produces
raw audio instead of pixels. You describe the music in words —

mellow acoustic guitar with soft marimba, gentle and cinematic, evoking ocean waves and a
rising sun over the sea, calm and spacious, fingerpicked, soft percussion, instrumental,
no vocals

— and out comes a .wav. Then a single ffmpeg command lays that track onto the video.

On Apple Silicon there are three realistic local options. I tried all three. One worked, one
failed in an instructive way, and one is locked behind a license.

The engine that works: MusicGen

MusicGen (from Meta) runs through Hugging
Face transformers, which I already had installed for other things. The whole generator is
about a dozen lines:

import torch, scipy.io.wavfile
from transformers import AutoProcessor, MusicgenForConditionalGeneration

model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")
proc  = AutoProcessor.from_pretrained("facebook/musicgen-medium")

device = "mps" if torch.backends.mps.is_available() else "cpu"
model.to(device)
inputs = proc(text=[PROMPT], padding=True, return_tensors="pt").to(device)

sr     = model.config.audio_encoder.sampling_rate           # 32000
tokens = int(15 * model.config.audio_encoder.frame_rate)    # ~50 Hz → 15s
audio  = model.generate(**inputs, max_new_tokens=tokens, do_sample=True, guidance_scale=3.0)

scipy.io.wavfile.write("track.wav", sr, audio[0, 0].float().cpu().numpy())

Two Apple-Silicon notes that matter:

Set PYTORCH_ENABLE_MPS_FALLBACK=1. MusicGen uses a few ops that aren't implemented on
MPS; this flag lets those quietly fall back to the CPU instead of crashing. With it, the
model runs on the GPU and just borrows the CPU for the handful of unsupported bits.
Have a CPU fallback path anyway. I wrap generate() in a try/except that retries the
whole thing on CPU if MPS throws. Belt and suspenders.

On the M1 Max, musicgen-medium produced a 15-second clip in about two to three minutes.
That's the track you're hearing on the reel above.

The engine that failed (and why I'm telling you): ACE-Step

ACE-Step is a newer 3.5B music model with native
ComfyUI nodes, which is lovely — it means I could drive it through the exact same API
harness I use for images. It downloaded, it ran, it finished in 48 seconds — faster than
MusicGen — with no errors at all.

And the output was pure static. White noise. Not a note of music.

If that sounds familiar, it should. It's the same failure mode as the
Wan rainbow-soup bug in Part 4:

The graph reports success while the result is garbage. On Apple Silicon, a clean
exit code is not proof that the math was right.

This is almost certainly an MPS precision / unsupported-op problem somewhere in ACE-Step's
transformer or its audio decoder — these newer models are tuned and tested on NVIDIA, and the
Metal backend silently produces nonsense for ops it handles differently. I haven't run the
fix to ground yet (candidates: the MPS-fallback flag, a different sampler/scheduler, or just
running it on CPU), so for now my rule is simple: on this hardware, MusicGen for music,
ACE-Step parked with a warning note.

I'm including the dead end on purpose. The honest map of "local AI on a Mac" has potholes in
it, and pretending otherwise just wastes the next person's afternoon.

The third option, Stable Audio Open, also has ComfyUI nodes — but its weights are
gated behind a license click, so it needs a Hugging Face token. Parked for another day.

Putting it together: one `ffmpeg` command

With a track.wav in hand, scoring the reel is a single command. Trim the audio to the
video's length, add a one-second fade-out, copy the video stream untouched so there's no
re-encode:

ffmpeg -y -i reel.mp4 -i track.wav \
  -filter_complex "[1:a]afade=t=out:st=14:d=1,atrim=0:15[a]" \
  -map 0:v:0 -map "[a]" -c:v copy -c:a aac -b:a 192k -shortest scored.mp4

-c:v copy is the important bit — the picture isn't touched or degraded, we're just
adding an audio track to the existing file. It runs in well under a second.

The numbers

Everything below on the Mac Studio — M1 Max, 64 GB, MPS, fully local:

Step	Detail	Time
Reel stills	6 × SDXL @ 832×1216	~419 s
Ken Burns + crossfade	`ffmpeg`	~6 s
Music	MusicGen-medium, 15 s	~180 s
Mux	`ffmpeg`, copy video	< 1 s
Total	end-to-end	~10 min

The music is roughly a third of the total time now — and like the images, it's a one-time
cost on hardware I already own, not a metered API call.

Where the series stands

The local studio can now do the whole job: image → motion → reel → soundtrack, all
offline on a Mac.

Part 1 — Install ·
Part 2 — Drive from code ·
Part 3 — FLUX vs SDXL
Part 4 — Video ·
Part 5 — A reel ·
Part 6 — Sound

No cloud, no API keys, no per-frame bill — and now with a soundtrack to match. Go turn your
Mac into a studio.