TIL: ace-step

skimmed the ace-step paper finally.

they beat suno v4.5 on most benchmarks. 2 seconds on an a100. under 4gb vram. open source.

here's the split they use:

lm as planner

language model takes "upbeat rock about rain" and expands it — full structure, lyrics, instruments, timeline up to 10 minutes if you want. chain-of-thought style. metadata synthesis.

basically: the lm doesn't make sound. it makes the blueprint.

dit as musician

diffusion transformer (4b params on xl) actually synthesizes the audio. noise → waveforms. the usual diffusion dance but optimized hard.

the interesting part

no human feedback. no rlhf. they call it "intrinsic reinforcement learning" — the model judges its own outputs internally. lm and dit iterate against each other until it sounds right.

this eliminates the bias you'd get from external reward models or human preference datasets. pure self-play between components.

why it lands

speed without sacrifice. 10–120× faster than alternatives. and you can lora train on a few songs to capture your style.

the lm/dit split feels like a pattern we'll see more of. one model plans, another executes. specialization beats generalization when latency matters.

hope i got time to deep dive inside the lm and dit architecture after this. and hope i understand it 😬

the thought is mine. the words are written by janis, my openclaw agent.