Flash attention

Prologue

The Toy Model

The Machine

Machine diagram Machine diagram

Instruction Set

Instruction set table Instruction set

Instruction	Operation	Cycles
`LOAD src, Rd`	`Rd = DRAM[src]`	1
`STORE Rs, dst`	`DRAM[dst] = Rs`	1
`MAC_Z R0, R1, Rd`	`Rd = R0 × R1`	1
`MAC R0, R1, Ra, Rd`	`Rd = R0 × R1 + Ra`	1

Problem 1: Compute c₀₀ — write the instruction trace Problem 1: Compute c₀₀

A, B \in \mathbb{R}^{2 \times 2} \qquad C = A \times B \qquad c_{ij} = \sum_k a_{ik} \cdot b_{kj}

Write the instruction trace for $c_{00} = a_{00} b_{00} + a_{01} b_{10}$ :

Step	Instruction	R0	R2
1	LOAD a[0][0], R0	a₀₀
2
3
4
5
6
7	STORE R2, c[0][0]		c₀₀

Full trace for c₀₀ Full trace for c₀₀

Step	Instruction	R0	R1	R2
1	`LOAD a[0][0], R0`	$a_{00}$	—	—
2	`LOAD b[0][0], R1`	$a_{00}$	$b_{00}$	—
3	`MAC_Z R0, R1, R2`	$a_{00}$	$b_{00}$	$a_{00} b_{00}$
4	`LOAD a[0][1], R0`	$a_{01}$	$b_{00}$	$a_{00} b_{00}$
5	`LOAD b[1][0], R1`	$a_{01}$	$b_{10}$	$a_{00} b_{00}$
6	`MAC R0, R1, R2, R2`	$a_{01}$	$b_{10}$	$c_{00}$
7	`STORE R2, c[0][0]`	$a_{01}$	$b_{10}$	$c_{00}$

Cost per output element: 4 LOADs + 2 MACs + 1 STORE = 7 cycles

The first MAC uses MAC_Z (multiply only, no accumulate) because there is no prior partial sum. The second uses MAC to accumulate into R2. The partial sum lives in R2 across both MACs — this is why we need three registers: two for operands (overwritten each iteration) and one for the running accumulator.

Problem 2: Full matmul — Locality! Problem 2: Full matmul

Now compute all four elements of $C$ . Naturally $4 \times 7 = 28$ cycles … or less?

Hint: which elements share operands? Hint

$c_{00}$ and $c_{01}$ both use row 0 of $A$ . After finishing $c_{00}$ , R0 holds $a_{01}$ — if you compute $c_{01}$ next, you can reuse it.

Discussion: minimum registers and modern GPUs Minimum registers and modern GPUs

This is the smallest machine that can compute a matrix multiplication. The MAC unit needs two operand registers (R0, R1), and any dot product longer than one term needs a third register to hold the partial sum — because loading the next pair of operands overwrites R0 and R1, and there is no way to accumulate without a register (MAC takes a register, not a RAM address, as its addend). Three registers is the minimum.

In modern GPU architectures, the compute unit (e.g., a Tensor Core) sometimes has its own dedicated memory that it manages internally — a small, private buffer separate from the shared register file. In that case, the accumulator lives inside the compute unit itself, and the register file only needs to supply operands. This is a design choice in NVIDIA’s Blackwell architecture: by giving the Tensor Core its own accumulator storage, you reduce pressure on the register file — effectively shrinking the “register file” in our toy model back toward 2 registers, with the third hidden inside the MAC unit.

Enlarging the Register File

Problem 3: Outer product matmul — write the trace with 6 registers Problem 3: Outer product matmul

C = \underbrace{\begin{bmatrix} a_{00} \\ a_{10} \end{bmatrix} \begin{bmatrix} b_{00} & b_{01} \end{bmatrix}}_{k=0} + \underbrace{\begin{bmatrix} a_{01} \\ a_{11} \end{bmatrix} \begin{bmatrix} b_{10} & b_{11} \end{bmatrix}}_{k=1}

We double the register file from 3 to 6 registers: how many cycles does the outer product approach take?

Hint: how to count LOADs in outer product Hint

Each outer product $k$ loads one column of $A$ and one row of $B$ , then performs 4 MACs (one per output element). For $k=0$ , use MAC_Z (no prior accumulator). For $k=1$ , use MAC to accumulate into the same registers. Count the LOADs carefully — can you reuse any operands within one outer product?

Full trace: outer product matmul Full trace: outer product matmul

Outer product $k = 0$ :

Step	Instruction	R0	R1	R2	R3	R4	R5
1	`LOAD a[0][0], R0`	$a_{00}$	—	—	—	—	—
2	`LOAD b[0][0], R1`	$a_{00}$	$b_{00}$	—	—	—	—
3	`MAC_Z R0, R1, R2`	$a_{00}$	$b_{00}$	$a_{00}b_{00}$	—	—	—
4	`LOAD b[0][1], R1`	$a_{00}$	$b_{01}$	”	—	—	—
5	`MAC_Z R0, R1, R3`	$a_{00}$	$b_{01}$	”	$a_{00}b_{01}$	—	—
6	`LOAD a[1][0], R0`	$a_{10}$	$b_{01}$	"	"	—	—
7	`MAC_Z R0, R1, R5`	$a_{10}$	$b_{01}$	"	"	—	$a_{10}b_{01}$
8	`LOAD b[0][0], R1`	$a_{10}$	$b_{00}$	"	"	—	“
9	`MAC_Z R0, R1, R4`	$a_{10}$	$b_{00}$	"	"	$a_{10}b_{00}$	"

Outer product $k = 1$ :

Step	Instruction	R0	R1	R2	R3	R4	R5
10	`LOAD a[0][1], R0`	$a_{01}$	$b_{00}$	"	"	"	"
11	`LOAD b[1][0], R1`	$a_{01}$	$b_{10}$	"	"	"	"
12	`MAC R0, R1, R2, R2`	$a_{01}$	$b_{10}$	$c_{00}$	"	"	"
13	`LOAD b[1][1], R1`	$a_{01}$	$b_{11}$	"	"	"	"
14	`MAC R0, R1, R3, R3`	$a_{01}$	$b_{11}$	"	$c_{01}$	"	"
15	`LOAD a[1][1], R0`	$a_{11}$	$b_{11}$	"	"	"	"
16	`MAC R0, R1, R5, R5`	$a_{11}$	$b_{11}$	"	"	"	$c_{11}$
17	`LOAD b[1][0], R1`	$a_{11}$	$b_{10}$	"	"	"	"
18	`MAC R0, R1, R4, R4`	$a_{11}$	$b_{10}$	"	"	$c_{10}$	"

Store:

Step	Instruction
19	`STORE R2, c[0][0]`
20	`STORE R3, c[0][1]`
21	`STORE R4, c[1][0]`
22	`STORE R5, c[1][1]`

Cost: 10 LOADs + 8 MACs + 4 STOREs = 22 cycles with 6 registers

	Inner product	Outer product
Registers	3	6
Cycles (full matmul)	25	22
LOADs	13	10

More registers → each loaded value gets reused across multiple output elements before being replaced → fewer total LOADs.

Quiz: inner product with 6 registers Quiz

Adding three more registers also accelerates the inner product. With 6 registers, what is the best inner product scheme? How many cycles does it take, and where do the savings come from?

Softmax

The Machine

Machine diagram Machine diagram

Instruction Set

Instruction set table Instruction set

Instruction	Operation
`LOAD src, Rd`	`Rd = DRAM[src]`
`STORE Rs, dst`	`DRAM[dst] = Rs`
`MAX Ra, Rb, Rd`	`Rd = max(Ra, Rb)`
`SUB Ra, Rb, Rd`	`Rd = Ra - Rb`
`EXP Ra, Rd`	`Rd = exp(Ra)`
`ADD Ra, Rb, Rd`	`Rd = Ra + Rb`
`MUL Ra, Rb, Rd`	`Rd = Ra × Rb`
`DIV Ra, Rb, Rd`	`Rd = Ra / Rb`
`MOV Rs, Rd`	`Rd = Rs`

Safe Softmax

\text{softmax}(x_i) = \frac{e^{x_i - m}}{\sum_{j=0}^{n-1} e^{x_j - m}} \quad \text{where} \quad m = \max_j x_j

Problem 4: Safe softmax on the machine Problem 4: Safe softmax

With 3 registers, compute the safe softmax. How many DRAM reads and writes total?

Hint: Pass 1 — Get max Hint: Pass 1

Find $m = \max(x_0, \ldots, x_{n-1})$

Instruction	R0	r1	r2
`LOAD x[0], r1`	—	$x_0$	—
`LOAD x[1], R0`	$x_1$	$x_0$	—
`MAX R0, r1, r1`	$x_1$	$\max(x_0, x_1)$	—
`LOAD x[2], R0`	$x_2$	$\max(x_0, x_1)$	—
`MAX R0, r1, r1`	$x_2$	$\max(x_0, x_1, x_2)$	—
…	…	…	—
`LOAD x[n-1], R0`	$x_{n-1}$	$\max(x_{0..n{-}2})$	—
`MAX R0, r1, r1`	$x_{n-1}$	$m$	—

Cost: $n$ reads, 0 writes. r1 = $m$ .

Hint: Pass 2 — Get sum of exponents Hint: Pass 2

Compute $S = \sum e^{x_i - m}$ using the known max. The exponent is computed and immediately consumed — never written to DRAM.

Instruction	R0	r1	r2
(r1 = m from pass 1)	—	$m$	—
`LOAD x[0], R0`	$x_0$	$m$	—
`SUB R0, r1, R0`	$x_0 - m$	$m$	—
`EXP R0, R0`	$e^{x_0 - m}$	$m$	—
`MOV R0, r2`	$e^{x_0 - m}$	$m$	$e^{x_0 - m}$
`LOAD x[1], R0`	$x_1$	$m$	$e^{x_0 - m}$
`SUB R0, r1, R0`	$x_1 - m$	$m$	”
`EXP R0, R0`	$e^{x_1 - m}$	$m$	”
`ADD r2, R0, r2`	$e^{x_1 - m}$	$m$	$e^{x_0-m} + e^{x_1-m}$
…	…	$m$	…

Cost: $n$ reads, 0 writes. r1 = $m$ , r2 = $S$ .

Hint: Pass 3 — Normalize Hint: Pass 3

Compute $\text{softmax}(x_i) = e^{x_i - m} / S$ . We recompute $e^{x_i - m}$ (since it was never stored) and divide by $S$ .

Instruction	R0	r1	r2
(r1 = m, r2 = S from pass 2)	—	$m$	$S$
`LOAD x[0], R0`	$x_0$	$m$	$S$
`SUB R0, r1, R0`	$x_0 - m$	$m$	$S$
`EXP R0, R0`	$e^{x_0-m}$	$m$	$S$
`DIV R0, r2, R0`	$e^{x_0-m}/S$	$m$	$S$
`STORE R0, out[0]`	$e^{x_0-m}/S$	$m$	$S$
`LOAD x[1], R0`	$x_1$	$m$	$S$
`SUB R0, r1, R0`	$x_1 - m$	$m$	$S$
`EXP R0, R0`	$e^{x_1-m}$	$m$	$S$
`DIV R0, r2, R0`	$e^{x_1-m}/S$	$m$	$S$
`STORE R0, out[1]`	$e^{x_1-m}/S$	$m$	$S$
…	…	$m$	$S$

Cost: $n$ reads, $n$ writes. The exponent $e^{x_i - m}$ is recomputed from $x_i$ and $m$ — we trade extra compute for avoiding DRAM storage of the intermediate array.

Hint: Total cost Hint: Total cost

Pass	DRAM reads	DRAM writes
1. Get max	$n$	0
2. Get sum of exponents	$n$	0
3. Normalize	$n$	$n$
Total	$3n$	$n$

No intermediate array $e_0, \ldots, e_{n-1}$ in DRAM — the exponents are computed on the fly in both pass 2 and pass 3.

One More Register

Two Paths to the Same Sum

Two paths to the same sum Two paths to the same sum

We want $S = \sum_{i=0}^{n-1} e^{x_i - m}$ where $m = \max_j x_j$ . Let $m_k = \max(x_0, \ldots, x_{k-1})$ be the running max after seeing $k$ elements.

Path 1: Global max first (two passes). Compute $m = m_n$ in one pass. Then define $s_k = \sum_{j=0}^{k-1} e^{x_j - m}$ — the partial sum using the known global max. The recurrence is:

s_k = s_{k-1} + e^{x_{k-1} - m}

Path 2: Running max (single pass). Define $s_k = \sum_{j=0}^{k-1} e^{x_j - m_k}$ — the partial sum using the running max $m_k$ . Update both together:

m_k = \max(m_{k-1},\; x_{k-1})

s_k = s_{k-1} \cdot e^{m_{k-1} - m_k} + e^{x_{k-1} - m_k}

The rescaling factor $e^{m_{k-1} - m_k}$ corrects all previously accumulated terms for the new max. Since $m_k \geq m_{k-1}$ , this factor is always $\leq 1$ — no overflow.

At $k = n$ : $m_n = m$ , so both definitions of $s_k$ agree: $s_n = S$ .

This is the online softmax algorithm (Milakov & Gimelshein, 2018).

Numerical safety Numerical safety

Writing the rescaling as $e^{m_{k-1} - m_k}$ is safe because the exponent is $\leq 0$ . Writing the equivalent $\frac{e^{m_{k-1}}}{e^{m_k}}$ is not — $e^{m_{k-1}}$ alone can overflow before the division happens. Same math, different numerical behavior. The choice of equivalent representation matters.

Nothing new under sun? Nothing new under sun?

The trick you just saw — maintaining a running state and correcting it when a reference point changes — sits at the intersection of statistics (the mean), numerical analysis (overflow avoidance), and algorithms (online computation). No single course teaches it, because each considers it someone else’s job, or too simple to bother with.

Architecture courses assume you know the math. Numerical analysis courses assume you know the systems. Algorithm courses assume the problem is abstract and the numbers are exact. The student is left holding pieces from three different boxes, with nobody having shown them that the pieces fit together.

Online Softmax with Four Registers

Problem 5: Write the instruction trace for online softmax with 4 registers Problem 5: Online softmax trace

Add one register to the machine (4 total: R0, r1, r2, r3). Using the online softmax recurrence from Path 2, write the instruction trace for the steady-state step — processing one new element $x_i$ and updating the running state $(m, s)$ .

How many DRAM reads and writes does the full softmax take with 4 registers?

Hint: full instruction trace Hint: instruction trace

Save old $m$ into r3 before updating r1 with the new max — the rescaling needs both.

Initialization (first element):

Instruction	R0	r1	r2	r3
`LOAD x[0], R0`	$x_0$	—	—	—
`MOV R0, r1`	$x_0$	$x_0$	—	—
`SUB R0, r1, R0`	$0$	$x_0$	—	—
`EXP R0, R0`	$1$	$x_0$	—	—
`MOV R0, r2`	$1$	$x_0$	$1$	—

After init: r1 = $m = x_0$ , r2 = $s = 1$ .

Steady state (each subsequent element $x_i$ ):

Step	Instruction	R0	r1	r2	r3
1	LOAD x[i], R0	xᵢ	m	s	—
2	MOV r1, r3	xᵢ	m	s	m
3	MAX r1, R0, r1	xᵢ	m'	s	m
4	SUB r3, r1, r3	xᵢ	m'	s	m − m'
5	SUB R0, r1, R0	xᵢ − m'	m'	s	m − m'
6	EXP r3, r3	xᵢ − m'	m'	s	e^(m−m')
7	EXP R0, R0	e^(xᵢ−m')	m'	s	e^(m−m')
8	MUL r2, r3, r2	e^(xᵢ−m')	m'	s·e^(m−m')	e^(m−m')
9	ADD r2, R0, r2	e^(xᵢ−m')	m'	s'	e^(m−m')

Trace explanation and cost analysis Trace explanation and cost

After each iteration: r1 = $m'$ , r2 = $s'$ . After processing all $n$ values: r1 = $m$ , r2 = $S$ .

Step 2 saves the old $m$ into r3 before step 3 overwrites r1 with $m'$ . This is necessary because the rescaling in step 4 needs both the old and new max. After that, r3 is reused for the rescaling factor, and R0 is reused for the new term — neither $x_i$ nor old $m$ is needed after step 5.

But we’re not done. This pass only computes $m$ and $S$ — the denominator. To produce the actual softmax values, we still need a second pass — same as pass 3 in the 3-register case.

Total cost (4 registers, 2 passes):

Pass	Operation	DRAM reads	DRAM writes
1. Online softmax	compute $m$ and $S$	$n$	0
2. Normalize	compute $e^{x_i - m} / S$ for all $i$	$n$	$n$
Total		$2n$	$n$

Cost Comparison

3 registers vs 4 registers 3 registers vs 4 registers

Property	3 registers (3 passes)	4 registers (2 passes)
Passes	3 (max + sum + normalize)	2 (online + normalize)
DRAM reads	$3n$	$2n$
DRAM writes	$n$	$n$
Compute per element	simple (SUB, EXP, ADD)	more (MAX, 2×SUB, 2×EXP, MUL, ADD)

One extra register saves one pass and $n$ DRAM reads by merging max-finding into the sum-accumulation. The cost is more compute per step, but for any nontrivial $n$ , the DRAM savings dominate.

Intuition and Exploration

Connections to other online algorithms Connections to other online algorithms

Online softmax replaces a global dependency (the max $m$ ) with a running state that self-corrects. The same structural pattern appears in other settings:

Cauchy sequences: convergence defined by terms getting close to each other (local), not to a known limit (global)
Abel summation: individual terms rewritten as partial sums + differences — the accumulated quantity becomes primary
Kahan summation: a compensation variable tracks rounding error, correcting the running sum at each step
Welford’s online variance: a running $(n, \bar{x}, M_2)$ state corrects for a shifting mean, just as online softmax corrects for a shifting max

See the mathematical foundations appendix for full definitions, derivations, and references.

Problem 6: Other scenarios for the running trick Problem 6: Other running tricks

Online softmax maintains a running state and corrects it when a reference point changes. Can you think of other scenarios — from any domain — where the same pattern applies?

Quiz: Online variance (Welford's algorithm) Quiz

The variance of $n$ values requires the global mean $\bar{x}$ , a similar global dependency as softmax’s need for the global max $m$ . Using the same principles (maintain a running state, apply corrections when the reference point changes), design an online algorithm that computes the variance in a single pass. What running state do you maintain? What is the correction step when a new element arrives?

Solution Solution

Definitions:

\delta = x - \bar{x} \qquad \bar{x} \leftarrow \bar{x} + \frac{\delta}{n} = \bar{x} + \frac{x - \bar{x}}{n}

S^2 = \frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}

Derivation. Compute $(N-1)S_N^2 - (N-2)S_{N-1}^2$ :

= \sum_{i=1}^{N}(x_i - \bar{x}_N)^2 - \sum_{i=1}^{N-1}(x_i - \bar{x}_{N-1})^2

= (x_N - \bar{x}_N)^2 + \sum_{i=1}^{N-1}\left[(x_i - \bar{x}_N)^2 - (x_i - \bar{x}_{N-1})^2\right]

= (x_N - \bar{x}_N)^2 + \sum_{i=1}^{N-1}(2x_i - \bar{x}_N - \bar{x}_{N-1})(\bar{x}_{N-1} - \bar{x}_N)

Using $\sum_{i=1}^{N-1} x_i = N\bar{x}_N - x_N$ and $(N-1)\bar{x}_{N-1} = N\bar{x}_N - x_N$ :

\sum_{i=1}^{N-1}(2x_i - \bar{x}_N - \bar{x}_{N-1}) = 2(N\bar{x}_N - x_N) - (N-1)\bar{x}_N - (N\bar{x}_N - x_N) = \bar{x}_N - x_N

So:

= (x_N - \bar{x}_N)^2 + (\bar{x}_N - x_N)(\bar{x}_{N-1} - \bar{x}_N)

= (x_N - \bar{x}_N)\left[(x_N - \bar{x}_N) - (\bar{x}_{N-1} - \bar{x}_N)\right]

= (x_N - \bar{x}_N)(x_N - \bar{x}_{N-1})

Result:

(N-1)S_N^2 = (N-2)S_{N-1}^2 + (x_N - \bar{x}_N)(x_N - \bar{x}_{N-1})

From Softmax to Attention

The Connection

Attention as softmax + matrix multiply The connection

The scaled dot-product attention is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V

The scaling $\frac{1}{\sqrt{d}}$ is a precomputed constant — we fold it into $Q$ ahead of time and write:

O = \text{softmax}(QK^T) \, V

Let $Q, K, V \in \mathbb{R}^{n \times d}$ where $n$ is the sequence length and $d$ is the head dimension. The naive approach computes this in three steps:

Compute $S = QK^T$ — an $n \times n$ matrix of scores
Apply row-wise softmax: $P = \text{softmax}(S)$ — an $n \times n$ attention matrix
Multiply: $O = PV$ — the $n \times d$ output

The written expression $\text{softmax}(QK^T) \cdot V$ suggests we must first compute the softmax, materialize the full $n \times n$ attention matrix $P$ , and then multiply by $V$ . But do we actually need to?

Tracing the Dependencies

What does a single output row depend on? Tracing the dependencies

What does a single output row $\mathbf{o}_i$ actually depend on? Decomposing by rows:

\mathbf{o}_i = \mathbf{p}_i \, V = \sum_j p_{ij} \, \mathbf{v}_j

where $p_{ij} = \frac{e^{x_j - m_i}}{\sum_k e^{x_k - m_i}}$ , with $x_j = \mathbf{q}_i \cdot \mathbf{k}_j$ and $m_i = \max_j x_j$ .

Each row’s softmax is independent — row $i$ depends only on $\mathbf{q}_i$ and all of $K$ . This connects directly to our online softmax: the softmax denominator for row $i$ is exactly the sum $S$ we computed in the previous section.

Expanding the output:

\mathbf{o}_i = \frac{\sum_j e^{x_j - m_i} \, \mathbf{v}_j}{\sum_k e^{x_k - m_i}}

This has the same structure as online softmax, except each term in the numerator carries a vector $\mathbf{v}_j$ instead of a scalar. The denominator is exactly the $S$ from before. The numerator is a weighted sum of value vectors, with the same exponential weights.

Applying Online Softmax

Problem 7: Derive the update rules for attention Problem 7: Online attention update

The output $\mathbf{o}_i$ is a ratio of a vector numerator and a scalar denominator. Both are weighted sums with the same exponential weights we saw in online softmax. Using the same running-max-and-rescale trick from Path 2, derive the update rules for the numerator and denominator when a new element $x_k$ (with value vector $\mathbf{v}_k$ ) arrives. What is the running state?

Hint: the rescaling factor is scalar Hint

The rescaling factor $e^{m_k - m_{k+1}}$ is a scalar — it distributes over the vector numerator. The denominator update is identical to online softmax.

Full update rules and running state Full update rules

m_{k+1} = \max(m_k, \; x_k)

\text{numerator}_{k+1} = \text{numerator}_k \cdot e^{m_k - m_{k+1}} + e^{x_k - m_{k+1}} \, \mathbf{v}_k

\text{denominator}_{k+1} = \text{denominator}_k \cdot e^{m_k - m_{k+1}} + e^{x_k - m_{k+1}}

After processing all $n$ keys:

\mathbf{o}_i = \frac{\text{numerator}_n}{\text{denominator}_n}

The running state at step $k$ is:

State	Shape	Description
$m_k$	scalar	running max
$\text{denominator}_k$	scalar	$\sum_{j=0}^{k-1} e^{x_j - m_k}$
$\text{numerator}_k$	vector ( $d$ )	$\sum_{j=0}^{k-1} e^{x_j - m_k} \, \mathbf{v}_j$

This is $O(d)$ storage per row — compared to $O(n)$ for a full row of the attention matrix, or $O(n^2)$ for the full matrix.

From Two Passes to One

Fusing softmax with V eliminates a pass From two passes to one

Recall that softmax alone required two passes even with online softmax: one pass to compute $m$ and $S$ , and a second pass to produce the actual softmax values $e^{x_i - m}/S$ for each $i$ . The second pass existed because we needed to output $n$ individual values — each one requires reading $x_i$ again.

But in attention, we don’t need the individual softmax values. We only need their weighted sum with $V$ . The running numerator already accumulates this weighted sum during the first pass. After processing all $n$ elements, the output is a single division:

\mathbf{o}_i = \frac{\text{numerator}_n}{\text{denominator}_n}

No second pass. Each softmax value $p_{ij}$ is produced, multiplied by $\mathbf{v}_j$ , accumulated into the numerator, and discarded — it never needs to exist as an individual value.

	Passes	Why
Softmax alone (4 registers)	2	must output each $e^{x_i - m}/S$ individually → second pass to read $x_i$ again
Attention (softmax fused with $V$ )	1	only need the weighted sum → final division at the end, no second pass

By connecting softmax to the multiplication by $V$ , the normalize pass disappears entirely.

The Attention Matrix Is Never Materialized

P never exists in memory P never exists in memory

Each attention score $x_j = \mathbf{q}_i \cdot \mathbf{k}_j$ is computed, immediately consumed into the running numerator and denominator, and discarded. The $n \times n$ attention matrix $P$ never needs to exist in memory.

Intermediate Elimination

The general pattern: fuse production and consumption Intermediate elimination

The pattern here is general: a large intermediate ( $P$ , size $n \times n$ ) is produced only to be immediately consumed by the next operation (multiplication by $V$ ). Because each element of $P$ is used exactly once in a structured reduction (weighted sum over columns of $V$ ), we can fuse production and consumption — compute each $p_{ij}$ , multiply by $\mathbf{v}_j$ , accumulate, and discard.

This same principle appears in many domains:

Kernel fusion (GPU computing): avoid writing intermediates to HBM between operations
Deforestation (functional programming): eliminate intermediate data structures when producer and consumer can be fused
Loop fusion (compilers): merge loops that produce and consume the same array
Our register file model: the attention score goes into a register, gets consumed into the running state, and the register is immediately reused — it never touches DRAM

The condition: if an intermediate is only ever used as part of a contraction/reduction, it does not need to exist as a full object. The softmax-then-multiply pattern in attention satisfies this exactly.

Element is Tile

The Tile Abstraction

Every element can be a tile — zero structural adjustment The tile abstraction

Everything so far is described in terms of individual scalar elements $x_j$ . But nothing in the formulation requires this — every element can be a tile, with zero structural adjustment.

Split a row of length $n$ into tiles of size $B$ . Tile $t$ covers elements $x[tB : (t+1)B]$ . Within each tile, we compute a local state:

$m^{(t)} = \max(x[tB : (t+1)B])$ — local tile max
$d^{(t)} = \sum_{j=tB}^{(t+1)B-1} e^{x_j - m^{(t)}}$ — local tile denominator
$\mathbf{num}^{(t)} = \sum_{j=tB}^{(t+1)B-1} e^{x_j - m^{(t)}} \, \mathbf{v}_j$ — local tile numerator

To merge two tiles:

m' = \max(m^{(t)},\; m^{(t+1)})

d' = d^{(t)} \cdot e^{m^{(t)} - m'} + d^{(t+1)} \cdot e^{m^{(t+1)} - m'}

\mathbf{num}' = \mathbf{num}^{(t)} \cdot e^{m^{(t)} - m'} + \mathbf{num}^{(t+1)} \cdot e^{m^{(t+1)} - m'}

This is exactly the same merge operation as the element-wise online softmax. The rescaling factor is still a scalar, it still distributes over vectors, and the merge is still associative. The formulation is scale-free — the merge only operates on the state $(m, d, \mathbf{num})$ and does not care about what is inside each tile.

This is why the toy model was worth building. The scalar version was not a simplified analogy — it is the algorithm, at a different granularity. Going from scalar to tiled is a change of mindset, not a change of structure.

Recall from our toy model that we already introduced this idea for matrix multiplication: the same instruction set works at both the scalar level (MAC on individual floats) and the tiled level (Tensor Core on sub-matrices). The same principle applies here.

At the tiled level, our toy model maps directly to a GPU:

Toy model	GPU
Register file	SRAM (shared memory)
DRAM	HBM
LOAD / STORE	data movement between HBM and SRAM
Compute unit (MAC / ALU)	Tensor Core

The register file in our model is the fast, small memory close to compute — that is SRAM on a GPU. The DRAM in our model is the large, slow memory — that is HBM. Every cost tradeoff we analyzed (fewer passes, fewer DRAM reads, more registers) translates directly: fewer HBM accesses, more SRAM usage.

Tiling Direction

Tiling across the key/value dimension Tiling direction

We are tiling across a row of the attention matrix — over the key/value dimension. Each tile processes a block of keys $K[tB:(t{+}1)B]$ and the corresponding values $V[tB:(t{+}1)B]$ , while keeping one query $\mathbf{q}_i$ fixed. This directly determines how $K$ and $V$ are blocked in memory.

Inner Product vs. Outer Product in Attention

Two loop orderings for attention Inner product vs. outer product

In our toy model, we saw two ways to compute a matrix multiplication: inner product (compute one output element fully, then move to the next) and outer product (load one pair of operands, update all output elements at once). The same choice appears in attention.

If we view the attention computation $O = \text{softmax}(QK^T) \, V$ as conceptually a matrix operation, we can choose which dimension to iterate over in the outer loop:

Inner product style: fix a query, stream through all KV

For each query $\mathbf{q}_i$ , stream through all key-value pairs. The running state for one row — $m$ (scalar), denominator (scalar), numerator (vector of size $d$ ) — stays in fast memory. After processing all KV pairs, $\mathbf{o}_i$ is complete. Move to the next query.

Running state in fast memory: $O(d)$ per query
K and V are reloaded for every query

Outer product style: fix a KV block, update all queries

Load a block of K and V into fast memory. Then stream through queries one at a time, updating each query’s running state with this KV block. After processing all queries, load the next KV block and repeat.

In the element-is-tile view, each “query” is one element (or one tile). Its running state — $m$ , denominator, numerator — is $O(d)$ , same as the inner product style. The difference is not how much state a single query needs, but how many queries’ outputs must be stored to HBM during the process.

If we can keep only one query’s output in fast memory, we load/store each query’s output once per KV element — $n$ stores per query. Each additional output slot we keep in fast memory saves $n - 1$ stores for that entry, because it can stay resident across all KV elements. At the extreme, if all $n$ outputs fit, each is stored once — and we’ve recovered the inner product schedule.

Running state per query: $O(d)$ — same as inner product
K and V are loaded once
Output stores per query: $n$ (once per KV element) — unless kept in fast memory across elements
All query updates within one KV element are independent → parallelizable

Cost Comparison

HBM traffic: inner product vs outer product Cost comparison

	Inner product	Outer product
Outer loop	over queries	over KV
Inner loop	over KV	over queries
Q loads	$n$ (once per query)	$n^2$ (every query reloaded per KV)
KV loads	$n^2$ (all KV reloaded per query)	$n$ (once per KV)
O loads	$n$ (initialize once per query)	$n^2$ (reloaded per KV)
O stores	$n$ (once per query)	$n^2$ (stored per KV)

The cost is not about fast memory capacity per query — it’s the same in both cases. The cost is about which data gets reloaded from HBM: KV (inner product) or output (outer product). Each additional output slot in fast memory saves $n - 1$ stores, moving along the spectrum from outer product toward inner product.

This is the flash attention loop ordering question:

Flash Attention v2 uses the inner product style (outer loop over queries, inner loop over KV). Each query’s output stays in SRAM across all KV blocks — stored once. KV is reloaded for every query block.
Flash Attention v1 uses the outer product style (outer loop over KV, inner loop over queries). A KV block stays in SRAM. Query outputs are loaded/stored from HBM for each KV block — $T_c$ stores per query. More HBM traffic for outputs, but the inner loop over query blocks is independent and can be parallelized across GPU thread blocks.

The same tradeoff we saw in the $2 \times 2$ matmul — 3 registers with inner product vs. 6 registers with outer product — plays out here at the scale of SRAM and HBM. But at the tiled level, it’s a continuous spectrum: the more output you keep in SRAM, the fewer HBM round-trips for outputs, at the cost of less SRAM available for KV blocks.

Quizzes

Arithmetic Intensity: Inner Product vs. Outer Product

Arithmetic Intensity: Flash Attention

Backward Pass

This section has not been validated and polished. The content below is a draft.

Backward pass (draft) Backward pass (draft)

Why Study the Backward Pass?

In PyTorch, requires_grad=True tells the autograd engine to preserve the intermediate states of the forward pass — so they can be reused when computing gradients. During inference we skip this; during training we keep everything.

But flash attention never materializes the $n \times n$ attention matrix $P$ . If the intermediate doesn’t exist, what do we differentiate through?

The answer is activation checkpointing: rerun the forward pass to regenerate the intermediates on demand. And here the structure of flash attention pays off twice. The first time through, everything is new — we use the online softmax trick to avoid materializing $P$ . The second time through, things are cheaper: we already have the row-wise maxima $m_i$ and denominators $d_i$ stored from the first pass. We don’t recompute them. The computational cost of the recomputation drops significantly.

So is that enough? Run activation checkpointing, recompute $P$ block by block during the backward pass, and we’re done?

No. The backward pass itself involves $n \times n$ intermediate matrices — $dP$ and $dS$ — that are just as large as $P$ . If we materialize those, we’ve solved nothing. We need to avoid materializing any $n \times n$ matrix during the backward pass too. That is why we need to study the details of what happens in the backward pass — not just that gradients flow through softmax, but how to compute them without ever building the full attention-sized matrices.

But before we dive in — let’s pause and make sure we understand what exactly allowed us to avoid materializing the attention matrix in the forward pass.

Quiz: What is the key ingredient that allows us to avoid materializing the attention matrix in the forward pass? It’s not online softmax — that handles the denominator. What handles the matrix itself?

The answer is contraction: $P$ only ever appears inside the sum $\sum_j P_{ij} \mathbf{v}_j$ . Each element is produced, multiplied by $\mathbf{v}_j$ , accumulated into the output, and discarded. An intermediate that feeds directly into a linear contraction never needs to be fully materialized.

Now consider the backward pass. The $n \times n$ intermediates $dP$ and $dS$ play the same role that $P$ played in the forward pass — they are large intermediates we want to eliminate. In the forward pass, we could merge all the intermediate steps because $P$ flowed directly into a linear contraction with $V$ . Can we do the same here?

The chain is: $dP \to dS \to dQ, dK$ . If we can contract $dS$ directly with the next step ( $dQ = dS \cdot K$ , $dK = dS^T \cdot Q$ ), then $dS$ does not need to be fully materialized — we need to discover the same contraction pattern again. But the step from $dP$ to $dS$ passes through the softmax Jacobian — a nonlinear derivative. This is the critical step that makes the backward pass harder than the forward pass. We will see a clever trick (the $D_i$ identity) that handles this step without ever materializing the full $n \times n$ matrices.

The Language of Tensors

A matrix is a $(1,1)$ tensor — one upper index (row), one lower index (column). The matrix product $C = AB$ can be written component-wise:

c_{ij} = \sum_k a_{ik} \, b_{kj}

where $i$ is the row, $j$ is the column, and $k$ is summed over. We can also write this in Einstein notation, where a repeated index — one upper, one lower — implies summation:

C^i{}_j = A^i{}_k \, B^k{}_j

No $\sum$ symbol needed. The repeated $k$ (lower in $A$ , upper in $B$ ) tells you to sum over it. This is the same convention used by torch.einsum.

Now consider the derivative of a matrix with respect to another matrix. If $C \in \mathbb{R}^{m \times n}$ depends on $A \in \mathbb{R}^{p \times q}$ , we need to ask: how does each entry $c_{ij}$ change when we perturb each entry $a_{kl}$ ? That requires four indices — two for the output ( $i, j$ ) and two for the input ( $k, l$ ):

\frac{\partial C^i{}_j}{\partial A^k{}_l}

This is a $(2,2)$ tensor with $m \cdot n \cdot p \cdot q$ entries — a Jacobian, but organized as a 4-dimensional object rather than a flattened matrix.

The simplest case: the derivative of a matrix with respect to itself. Each component $A^i{}_s$ is an independent variable, so its derivative with respect to $A^k{}_l$ is 1 when $i = k$ and $s = l$ , and 0 otherwise. In Kronecker delta notation:

\frac{\partial A^i{}_s}{\partial A^k{}_l} = \delta^i_k \, \delta^l_s

This is the fundamental identity. The two deltas enforce $i = k$ (same row) and $s = l$ (same column).

Derivative of a matrix product. Given $C^i{}_j = A^i{}_s \, B^s{}_j$ , we differentiate with respect to $A^k{}_l$ :

\frac{\partial C^i{}_j}{\partial A^k{}_l} = \frac{\partial (A^i{}_s \, B^s{}_j)}{\partial A^k{}_l}

$B$ does not depend on $A$ , so we can pull it out:

= \frac{\partial A^i{}_s}{\partial A^k{}_l} \, B^s{}_j

Apply the fundamental identity:

= \delta^i_k \, \delta^l_s \, B^s{}_j

The $\delta^l_s$ contracts with $B^s{}_j$ (setting $s = l$ ):

= \delta^i_k \, B^l{}_j

This says: row $i$ of $C$ only depends on row $i$ of $A$ .

Chain rule. For a composition $D = f(C) = f(AB)$ :

\frac{\partial D^{\alpha}{}_{\beta}}{\partial A^k{}_l} = \frac{\partial D^{\alpha}{}_{\beta}}{\partial C^i{}_j} \cdot \frac{\partial C^i{}_j}{\partial A^k{}_l} = \frac{\partial D^{\alpha}{}_{\beta}}{\partial C^k{}_j} \, B^l{}_j

The Kronecker delta collapses the sum over $i$ , leaving only $i = k$ — we only need derivatives of $D$ with respect to row $k$ of $C$ .

Contracting a delta with a denominator index. When a Kronecker delta contracts with an index that appears in the denominator of a partial derivative, the variance flips. For example, in the expression:

\frac{\partial L}{\partial O^i{}_j} \cdot \delta^l_j

The index $j$ is lower in $O^i{}_j$ , but taking $\frac{\partial}{\partial O^i{}_j}$ flips its variance — $j$ becomes upper in $\frac{\partial L}{\partial O^i{}_j}$ . The $\delta^l_j$ has $j$ as lower. Upper meets lower → contraction, setting $j = l$ :

= \frac{\partial L}{\partial O^i{}_l}

In Euclidean space the component values don’t change when we raise or lower indices, so this is purely bookkeeping — but it tells us which indices get summed.

Matrix-to-Matrix Derivatives

The derivative of a matrix $C \in \mathbb{R}^{m \times n}$ with respect to a matrix $A \in \mathbb{R}^{p \times q}$ is a 4-index object:

\frac{\partial C^i{}_j}{\partial A^k{}_l}

This has $m \cdot n \cdot p \cdot q$ entries — a $(2,2)$ tensor. For the attention forward pass $Q, K \to S \to P \to O$ , the full Jacobians at each step are 4D tensors that are expensive to compute and store.

The Collapse: (2,2) to (1,1)

In practice, backpropagation never computes the full 4D Jacobians. Instead, given a scalar loss $L$ and the upstream gradient $dO = \frac{\partial L}{\partial O}$ (same shape as $O$ , a matrix — a $(1,1)$ tensor), we compute the vector-Jacobian product (VJP):

dA^k{}_l = \frac{\partial L}{\partial A^k{}_l} = \frac{\partial L}{\partial C^i{}_j} \cdot \frac{\partial C^i{}_j}{\partial A^k{}_l}

The $(1,1)$ upstream gradient contracts with the $(2,2)$ Jacobian, and the result collapses back to a $(1,1)$ tensor — a matrix. The 4D Jacobian is never built. Every derivative in the chain can be represented as a matrix, not a 4D tensor.

This is the same intermediate elimination pattern from the forward pass. There, the $n \times n$ attention matrix $P$ was the large intermediate that was never materialized. Here, the $(2,2)$ Jacobian is the large intermediate — and it too is never materialized, because it is immediately contracted with the upstream gradient.

The Backward Pass Step by Step

We use $P$ for the attention matrix (called $A$ in our earlier sections) to match the flash attention paper notation. All $d$ -prefixed matrices are gradients of the scalar loss $L$ :

Symbol	Shape	Description
$Q, K, V$	$n \times d$	inputs
$S = QK^T$	$n \times n$	pre-softmax scores
$P = \text{softmax}(S)$	$n \times n$	attention weights
$O = PV$	$n \times d$	output
$dO$	$n \times d$	$\frac{\partial L}{\partial O}$ — upstream gradient (given)
$dV, dQ, dK$	$n \times d$	what we want
$dP$	$n \times n$	$\frac{\partial L}{\partial P}$ — intermediate
$dS$	$n \times n$	$\frac{\partial L}{\partial S}$ — intermediate
$D_i$	$n$ scalars	row-wise dot product of $dO$ and $O$

Forward was: $Q, K \to S = QK^T \to P = \text{softmax}(S) \to O = PV$

Backward reverses this:

Step 1: Through $O = PV$

We have $O^i{}_j = P^i{}_s \, V^s{}_j$ . We want $dV^k{}_l = \frac{\partial L}{\partial V^k{}_l}$ .

Deriving $dV$ :

First, the Jacobian. Using our derivative-of-a-product result (differentiating with respect to the second factor this time):

\frac{\partial O^i{}_j}{\partial V^k{}_l} = P^i{}_k \, \delta^l_j

Now contract with $dO$ via the VJP — recall that $dV^k{}_l = \frac{\partial L}{\partial V^k{}_l}$ , which by the chain rule is:

dV^k{}_l = \frac{\partial L}{\partial V^k{}_l} = \frac{\partial L}{\partial O^i{}_j} \cdot \frac{\partial O^i{}_j}{\partial V^k{}_l}

Substituting the Jacobian:

= \frac{\partial L}{\partial O^i{}_j} \cdot P^i{}_k \cdot \delta^l_j

The $\delta^l_j$ contracts with the $j$ in the denominator (variance flip: $j$ is upper in $\frac{\partial L}{\partial O^i{}_j}$ , lower in $\delta^l_j$ ), setting $j = l$ :

= \frac{\partial L}{\partial O^i{}_l} \cdot P^i{}_k

The repeated $i$ is a contraction — this is a matrix product:

= (P^T)^k{}_i \cdot \frac{\partial L}{\partial O^i{}_l} = \left(P^T \, \frac{\partial L}{\partial O}\right)^k{}_l

In matrix form: $dV = P^T \, dO$ .

Deriving $dP$ :

The Jacobian with respect to the first factor:

\frac{\partial O^i{}_j}{\partial P^k{}_l} = \delta^i_k \, V^l{}_j

Contract with $\frac{\partial L}{\partial O}$ via the VJP:

dP^k{}_l = \frac{\partial L}{\partial P^k{}_l} = \frac{\partial L}{\partial O^i{}_j} \cdot \frac{\partial O^i{}_j}{\partial P^k{}_l}

Substituting the Jacobian:

= \frac{\partial L}{\partial O^i{}_j} \cdot \delta^i_k \cdot V^l{}_j

The $\delta^i_k$ contracts with $i$ in the denominator (variance flip), setting $i = k$ :

= \frac{\partial L}{\partial O^k{}_j} \cdot V^l{}_j

The repeated $j$ is a contraction:

= \left(\frac{\partial L}{\partial O}\right)^k{}_j \cdot (V^T)^j{}_l = \left(\frac{\partial L}{\partial O} \, V^T\right)^k{}_l

In matrix form: $dP = dO \, V^T$ .

Extracting row 0. $(dP)^0{}_l = (dO)^0{}_\beta \, (V^T)^\beta{}_l = \sum_{\beta} dO_{0\beta} \, V^T_{\beta l}$

The only part that depends on row 0 is $dO_{0\beta}$ — the matrix $V^T$ is shared across all rows. So row 0 of $dP$ is simply row 0 of $dO$ times $V^T$ .

Step 2: Through $P = \text{softmax}(S)$

\frac{\partial P_i}{\partial S_j} = P_i(\delta_{ij} - P_j) \qquad O = PV

dP = (dP)^k{}_l = \left(\frac{\partial L}{\partial P}\right)^k{}_l = (dO \, V^T)^k{}_l

Say $k = 0$ .

\frac{\partial L}{\partial P_{0l}} = \left(\frac{\partial L}{\partial P}\right)^0{}_l = (dO \, V^T)^0{}_l = (dO)^0{}_k \, (V^T)^k{}_l = \sum_k (dO)_{0k} \, V_{lk}

where $(V^T)^k{}_l = V_{lk}$ — transpose in Einstein notation swaps the indices.

\left(\frac{\partial L}{\partial S}\right)^0{}_j = \frac{\partial L}{\partial P^0{}_l} \frac{\partial P^0{}_l}{\partial S^0{}_j} = \sum_l \frac{\partial L}{\partial P_{0l}} \frac{\partial P_l}{\partial S_{0j}}

= \sum_l \frac{\partial L}{\partial P_{0l}} P_{0l} (\delta_{lj} - P_{0j}) = \frac{\partial L}{\partial P_{0j}} P_{0j} - \sum_l \frac{\partial L}{\partial P_{0l}} P_{0l} P_{0j}

= dP_{0j} \cdot P_{0j} - \sum_l \left(\sum_k (dO)_{0k} \, V_{lk}\right) P_{0l} \, P_{0j}

= dP_{0j} \cdot P_{0j} - P_{0j} \sum_k (dO)_{0k} \sum_l P_{0l} \, V_{lk}

= P_{0j} \, dP_{0j} - P_{0j} \sum_k (dO)_{0k} \, O_{0k}

= P_{0j} \, dP_{0j} - P_{0j} \, D_0 \qquad D_0 = \sum_k (dO)_{0k} \, O_{0k}

\Rightarrow \left(\frac{\partial L}{\partial S}\right)^i{}_j = P_{ij} \, dP_{ij} - P_{ij} \, D_i \qquad D_i = \sum_k (dO)_{ik} \, O_{ik}

Step 3: Through $S = QK^T$

dQ = dS \cdot K \qquad dK = dS^T \cdot Q

Step 4: Through the QKV projections

The inputs $Q$ , $K$ , $V$ are all projections of the same input $X$ :

Q = XW_Q \qquad K = XW_K \qquad V = XW_V

Since all three depend on $X$ , the gradient flows back through all three paths and sums:

dX = dQ \, W_Q^T + dK \, W_K^T + dV \, W_V^T

Expanding $dQ$ , $dK$ , $dV$ :

dX = (dS \cdot K) \, W_Q^T + (dS^T \cdot Q) \, W_K^T + (P^T \, dO) \, W_V^T

= dS \cdot (K \, W_Q^T) + dS^T \cdot (Q \, W_K^T) + P^T \cdot (dO \, W_V^T)

By associativity, the projection weights can be folded into the right operand first. The parenthesized products $KW_Q^T$ , $QW_K^T$ , and $dO \, W_V^T$ are each $n \times d$ — cheap to precompute and not $n \times n$ .

But $dX$ is only the activation gradient. During training, we also need the weight gradients for the parameter update. Since $Q = XW_Q$ , $K = XW_K$ , $V = XW_V$ :

dW_Q = X^T \, dQ = X^T (dS \cdot K)

dW_K = X^T \, dK = X^T (dS^T \cdot Q)

dW_V = X^T \, dV = X^T (P^T \, dO)

Each weight gradient is $d \times d$ — small. But computing them requires $dQ$ , $dK$ , $dV$ explicitly. This conflicts with the associativity trick above: if we fold $dQ$ directly into $dX$ and discard it, we can’t also use it for $dW_Q$ .

The resolution: each row of $dQ$ , $dK$ , $dV$ can be consumed by two accumulators simultaneously. For example, when row $i$ of $dQ$ is produced, it contributes a row to $dX$ (via $W_Q^T$ ) and a rank-1 update to $dW_Q$ (via $x_i^T \cdot (dQ)_i$ ), then is discarded. The intermediates are still never materialized as full matrices — they’re just consumed twice instead of once before being discarded.

In practice, the flash attention kernel stops at $dQ$ , $dK$ , $dV$ . The projection weight gradients and $dX$ are handled by the framework’s standard linear layer backward pass — there is no $n \times n$ intermediate involved in those steps, so no special treatment is needed.

Quiz: Why do we need to compute , , at all? What are they used for, and why can’t we skip them?

Quiz: How much compute is required to compute the QKV projection weight updates (, , ), and how much compute is required to compute ? Express in terms of and .

Execution Schedule

Recall from the forward pass: having an expression is not the same as having an execution schedule. In attention, the expression $O = \text{softmax}(QK^T)V$ admits multiple execution orders — and the choice determines whether we materialize $P$ or not. The same question applies here.

We want to avoid materializing both $dP$ and $dS$ (both $n \times n$ ). Since softmax is row-wise, it’s natural to work row by row: compute a row of $dS$ , then immediately use it.

Say we’ve computed row 0 of $dS$ . Now consider the two consumers:

$dQ = dS \cdot K$ : row 0 of $dS$ times all of $K$ gives row 0 of $dQ$ . This is the inner product style — one row of the left matrix updates one row of the output completely. No problem.

$dK = dS^T \cdot Q$ : transposing $dS$ turns row 0 into column 0. A column of the left matrix times a row of $Q$ doesn’t give a single row of $dK$ — it gives a rank-1 update to the entire $dK$ matrix. This is the outer product style: column 0 of $dS^T$ (which is row 0 of $dS$ ) times row 0 of $Q$ updates all of $dK$ .

So the two gradients require different computation patterns:

Gradient	Style	What happens per row of $dS$
$dQ = dS \cdot K$	inner product	row of $dS$ $\times$ all of $K$ → completes one row of $dQ$
$dK = dS^T \cdot Q$	outer product	row of $dS$ (as column) $\times$ one row of $Q$ → rank-1 update to all of $dK$

For $dK$ , we keep the full $dK$ matrix in memory and accumulate into it as each row of $dS$ is produced. We don’t load all of $Q$ — just one row at a time, paired with the corresponding row of $dS$ .

And as before, element is tile: each element of $dS$ can be a scalar (one entry) or a tile (a block of entries in the row). The same execution schedule works at both granularities.

Row-wise vs. column-wise. But wait — we chose to iterate row-wise because softmax is row-wise. What if we iterate column-wise instead? Consider column $j$ of $dS$ . We also need $dV = P^T \, dO$ , which we haven’t accounted for yet — with row-wise iteration, row $i$ of $P$ becomes column $i$ of $P^T$ , giving another outer product.

Count the inner vs. outer products for both strategies:

Row-wise (iterate over rows of $dS$ , rows of $P$ ):

Gradient	Style	Why
$dQ = dS \cdot K$	inner product	row of $dS$ $\times$ $K$ → completes one row of $dQ$
$dK = dS^T \cdot Q$	outer product	row of $dS$ = column of $dS^T$ , rank-1 update to all of $dK$
$dV = P^T \, dO$	outer product	row of $P$ = column of $P^T$ , rank-1 update to all of $dV$

Score: 1 inner, 2 outer.

Column-wise (iterate over columns of $dS$ , columns of $P$ ):

Gradient	Style	Why
$dQ = dS \cdot K$	outer product	column $j$ of $dS$ $\times$ row $j$ of $K$ → rank-1 update to all of $dQ$
$dK = dS^T \cdot Q$	inner product	column $j$ of $dS$ = row $j$ of $dS^T$ , times $Q$ → completes row $j$ of $dK$
$dV = P^T \, dO$	inner product	column $j$ of $P$ = row $j$ of $P^T$ , times $dO$ → completes row $j$ of $dV$

Score: 2 inner, 1 outer.

Column-wise is better balanced. And for the weight gradients, it’s even cleaner: with $dK$ and $dV$ produced row-by-row (inner product style), each row immediately gives a rank-1 update to $dW_K$ and $dW_V$ . Only $dW_Q$ requires the outer product accumulation.

But softmax is row-wise — can we actually compute $dS$ column-wise? Yes. Column $j$ of $S$ is $Q \, k_j$ (all queries dotted with key $j$ ). Then $P_{ij} = e^{S_{ij} - m_i} / \ell_i$ using the precomputed per-row statistics $m_i$ and $\ell_i$ . Column $j$ of $dP$ is $dO \cdot v_j$ . Then $dS_{ij} = P_{ij}(dP_{ij} - D_i)$ with precomputed $D_i$ . Everything works — the row-wise structure of softmax is captured in the stored scalars, and the iteration itself can proceed column by column.

Can we push further? The gradients $dQ$ , $dK$ , $dV$ are all $n \times d$ — not $n \times n$ , so less urgent. But for long sequences ( $n$ in the millions), even $n \times d$ matrices are large. Do we need to materialize them, or can they be consumed immediately too?

Recall that the final target is $dX = dQ \, W_Q^T + dK \, W_K^T + dV \, W_V^T$ . Each of $dQ$ , $dK$ , $dV$ feeds into a linear contraction with a projection weight matrix — exactly the pattern that allows intermediate elimination.

Term 1: $dS \cdot (K \, W_Q^T)$ . Inner product style, fully streamable. $K \, W_Q^T$ is precomputed. Row $i$ of $dS$ times $K \, W_Q^T$ produces row $i$ of the first contribution to $dX$ . Produce, consume, discard — $dQ$ is never materialized.

Term 2: $dS^T \cdot (Q \, W_K^T)$ . $Q \, W_K^T$ is precomputed. For each row $i$ of $dS$ (which becomes column $i$ of $dS^T$ ): outer product of column $i$ of $dS^T$ with row $i$ of $Q \, W_K^T$ , accumulated into $dX$ . Neither $dK$ nor $dS^T$ is materialized.

Term 3: $P^T \cdot (dO \, W_V^T)$ . $dO \, W_V^T$ is precomputed. Now $P^T$ needs columns of $P$ — but the outer product decomposition avoids this. Row $i$ of $P$ (which we already have from the row-wise softmax recomputation) becomes column $i$ of $P^T$ . Outer product of column $i$ of $P^T$ with row $i$ of $dO \, W_V^T$ , accumulated into $dX$ . We never need a full column of $P$ — just one row at a time.

All three terms can be accumulated into a single $dX$ matrix as we stream row-by-row through $dS$ and $P$ . The intermediates $dQ$ , $dK$ , $dV$ are eliminated by the same contraction principle that eliminated $P$ in the forward pass.

The $D_i$ Trick: Why It Matters

At the row level, $D_i$ is just a dot product — trivial to compute from the recomputed row of $P$ and the row of $dP$ . So why does flash attention go out of its way to precompute it as $D_i = \sum_k (dO)_{ik} \, O_{ik}$ ?

The answer is arithmetic intensity — specifically, how many times $K$ and $V$ are loaded from HBM.

Without the trick, the backward pass for each block requires two loads of $K$ :

Load $K$ block into SRAM to recompute $P$ (via $S = QK^T$ ). Load $V$ block to compute $dP$ (via $dO \cdot V^T$ ). Compute $D_i$ from the row of $P$ and $dP$ . But $D_i$ needs the full row — so we must finish all blocks in this row before proceeding.
Load $K$ block into SRAM again to compute $dQ = dS \cdot K$ .

$K$ is loaded from HBM twice. The dependency on $D_i$ prevents fusing the recomputation of $P$ with the consumption of $dS$ , because $dS_{ij} = P_{ij}(dP_{ij} - D_i)$ requires $D_i$ which requires the full row.

With the trick, $D_i = \sum_k (dO)_{ik} \, O_{ik}$ is precomputed from quantities already in hand — no $P$ needed. Now everything fuses into a single load per block:

Load $K$ block, $V$ block into SRAM
Recompute $P$ block ( $K$ is in SRAM)
Compute $dP$ block ( $V$ is in SRAM)
Compute $dS$ block (using precomputed $D$ — no waiting)
Compute $dQ$ contribution ( $K$ is still in SRAM)
Accumulate $dK$ contribution
Discard block

$K$ and $V$ are loaded once instead of twice. The $D$ trick removes the last dependency that forced a second load, enabling the recomputation and gradient computation to be fused into a single pass over the data. This doubles the arithmetic intensity of the backward pass.

Prologue

The Toy Model

The Machine

Instruction Set

Enlarging the Register File

Softmax

The Machine

Instruction Set

Safe Softmax

One More Register

Two Paths to the Same Sum

Online Softmax with Four Registers

Cost Comparison

Intuition and Exploration

From Softmax to Attention

The Connection

Tracing the Dependencies

Applying Online Softmax

From Two Passes to One

The Attention Matrix Is Never Materialized

Intermediate Elimination

Element is Tile

The Tile Abstraction

Tiling Direction

Inner Product vs. Outer Product in Attention

Cost Comparison

Quizzes

Arithmetic Intensity: Inner Product vs. Outer Product

Arithmetic Intensity: Flash Attention

Backward Pass

Why Study the Backward Pass?

The Language of Tensors

Matrix-to-Matrix Derivatives

The Collapse: (2,2) to (1,1)

The Backward Pass Step by Step

Execution Schedule

The DiD_iDi​ Trick: Why It Matters

The $D_i$ Trick: Why It Matters