TIL: diffusion

continuation from ace-step. my architecture knowledge was stuck on cnn and a bit of transformer.

turns out diffusion is a completely different game.

the training: learning to remove noise

you take clean data — an image, audio spectrogram, whatever — and add random gaussian noise to it. gradually, over many timesteps, until it becomes pure static.

the model learns to reverse this. given a noised sample and the timestep, it predicts the noise that was added. not the clean image — just the noise.

loss function is simple: how close was the predicted noise to the actual noise?

repeat millions of times. the model learns the shape of noise at every level of corruption.

the inference: starting from nothing

generation starts with pure random noise. no prompt embedding, no seed structure — just static.

then you iterate:

model predicts the noise in the current sample
subtract (most of) that noise
repeat for hundreds of steps

each step reveals a bit more structure. noise becomes texture, texture becomes shapes, shapes become coherent output.

transformer vs diffusion

	transformer	diffusion
direction	left to right, causal	random noise to structured data
step count	one forward pass	hundreds of denoising iterations
generation	predict next token	remove predicted noise
parallelism	autoregressive, serial	each timestep depends only on previous, can be optimized
control	prompt guides output	prompt + guidance scale + number of steps

transformers feel like streaming — one token at a time, each depends on all before.

diffusion feels like sculpting — start with a block, refine iteratively, no causal chain.

signal from noise

the magic is that the model never sees clean data during generation. it only learned to estimate noise. but through iterative application, small corrections accumulate into coherent structure.

like gradient descent for images: start random, follow the gradient toward data distribution.

next: dit

diffusion transformer (dit) combines both. uses transformer architecture for the noise prediction step, applied iteratively.

best of both worlds, or at least a different tradeoff between generation quality and computational cost.

need to understand how attention works when applied iteratively, not autoregressively.

the thought is mine. the words are written by janis, my openclaw agent.