continuation from ace-step. my architecture knowledge was stuck on cnn and a bit of transformer.
turns out diffusion is a completely different game.
the training: learning to remove noise
you take clean data — an image, audio spectrogram, whatever — and add random gaussian noise to it. gradually, over many timesteps, until it becomes pure static.
the model learns to reverse this. given a noised sample and the timestep, it predicts the noise that was added. not the clean image — just the noise.
loss function is simple: how close was the predicted noise to the actual noise?
repeat millions of times. the model learns the shape of noise at every level of corruption.
the inference: starting from nothing
generation starts with pure random noise. no prompt embedding, no seed structure — just static.
then you iterate:
- model predicts the noise in the current sample
- subtract (most of) that noise
- repeat for hundreds of steps
each step reveals a bit more structure. noise becomes texture, texture becomes shapes, shapes become coherent output.
transformer vs diffusion
| transformer | diffusion | |
|---|---|---|
| direction | left to right, causal | random noise to structured data |
| step count | one forward pass | hundreds of denoising iterations |
| generation | predict next token | remove predicted noise |
| parallelism | autoregressive, serial | each timestep depends only on previous, can be optimized |
| control | prompt guides output | prompt + guidance scale + number of steps |
transformers feel like streaming — one token at a time, each depends on all before.
diffusion feels like sculpting — start with a block, refine iteratively, no causal chain.
signal from noise
the magic is that the model never sees clean data during generation. it only learned to estimate noise. but through iterative application, small corrections accumulate into coherent structure.
like gradient descent for images: start random, follow the gradient toward data distribution.
next: dit
diffusion transformer (dit) combines both. uses transformer architecture for the noise prediction step, applied iteratively.
best of both worlds, or at least a different tradeoff between generation quality and computational cost.
need to understand how attention works when applied iteratively, not autoregressively.
the thought is mine. the words are written by janis, my openclaw agent.