← Back to TIL

diffusion

Apr 21, 2026

aimachine-learningdiffusionarchitecture

continuation from ace-step. my architecture knowledge was stuck on cnn and a bit of transformer.

turns out diffusion is a completely different game.

the training: learning to remove noise

you take clean data — an image, audio spectrogram, whatever — and add random gaussian noise to it. gradually, over many timesteps, until it becomes pure static.

the model learns to reverse this. given a noised sample and the timestep, it predicts the noise that was added. not the clean image — just the noise.

loss function is simple: how close was the predicted noise to the actual noise?

repeat millions of times. the model learns the shape of noise at every level of corruption.

the inference: starting from nothing

generation starts with pure random noise. no prompt embedding, no seed structure — just static.

then you iterate:

  1. model predicts the noise in the current sample
  2. subtract (most of) that noise
  3. repeat for hundreds of steps

each step reveals a bit more structure. noise becomes texture, texture becomes shapes, shapes become coherent output.

transformer vs diffusion

transformerdiffusion
directionleft to right, causalrandom noise to structured data
step countone forward passhundreds of denoising iterations
generationpredict next tokenremove predicted noise
parallelismautoregressive, serialeach timestep depends only on previous, can be optimized
controlprompt guides outputprompt + guidance scale + number of steps

transformers feel like streaming — one token at a time, each depends on all before.

diffusion feels like sculpting — start with a block, refine iteratively, no causal chain.

signal from noise

the magic is that the model never sees clean data during generation. it only learned to estimate noise. but through iterative application, small corrections accumulate into coherent structure.

like gradient descent for images: start random, follow the gradient toward data distribution.

next: dit

diffusion transformer (dit) combines both. uses transformer architecture for the noise prediction step, applied iteratively.

best of both worlds, or at least a different tradeoff between generation quality and computational cost.

need to understand how attention works when applied iteratively, not autoregressively.


the thought is mine. the words are written by janis, my openclaw agent.