Within the Chinese academic community, this site was relatively early in focusing on linear attention. When writing the first related blog post "Exploring Linear Attention: Must Attention Have a Softmax?" in 2020, discussions primarily centered around BERT-related softmax attention. In hindsight, considering linear attention during the BERT era wasn't particularly wise, as training lengths were relatively short, and models were primarily encoder-based, offering little advantage for linear attention. Regarding this perspective, the author also wrote "Linear Transformer Probably Isn't the Model You've Been Waiting For" to express this view.
Until ChatGPT's emergence forced everyone to develop decoder-only generative models, which aligns highly with the RNN form of linear attention. Simultaneously, the pursuit of longer training lengths made the quadratic complexity bottleneck of softmax attention increasingly apparent. In this new context, linear attention has increasingly demonstrated competitiveness, even showing signs of "back-feeding" softmax attention.
Quadratic Complexity#
First, we introduce some notation:
An attention model essentially constitutes a mapping \(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}\to \boldsymbol{O}\). This article primarily concerns causal scenarios, meaning \(\boldsymbol{o}_t\) depends at most on \(\boldsymbol{Q}_{[:t]},\boldsymbol{K}_{[:t]},\boldsymbol{V}_{[:t]}\). In principle, the \(d\) of \(\boldsymbol{Q},\boldsymbol{K}\) can differ from that of \(\boldsymbol{V},\boldsymbol{O}\); for instance, GAU and MLA exemplify this, but simplifying them to the same dimension doesn't alter the essential problem.
Standard softmax attention typically refers to the attention mechanism proposed in "Attention is All You Need":
Here, the scaling factor \(1/\sqrt{d}\) is omitted since it can always be absorbed into \(\boldsymbol{Q},\boldsymbol{K}\). \(\mathop{\text{softmax}}\) performs exponential normalization along the second dimension, and \(\boldsymbol{M}\in\mathbb{R}^{n\times n}\) is a lower triangular matrix called the mask matrix, defined as:
\(\log\boldsymbol{M}\) means taking the logarithm element-wise for \(\boldsymbol{M}\), where \(\log 0 = -\infty\). Writing softmax attention in component form yields:
The denominator primarily maintains numerical stability; additionally, if we apply RMSNorm to \(\boldsymbol{O}\), the denominator automatically cancels. Thus, the core of softmax attention lies in the numerator part:
where \(\odot\) denotes Hadamard product, and \(\exp\) is applied element-wise. The denominator simply replaces \(\boldsymbol{V}\) with an \(n\times 1\) all-ones matrix, which we can supplement if needed. The standard implementation of softmax attention requires computing the \(n\times n\) matrix \(\exp(\boldsymbol{Q}\boldsymbol{K}^{\top})\), so both space and time complexity are proportional to \(n^2\). The emergence of Flash Attention reduces space requirements, but quadratic time complexity remains unavoidable.
Original Form#
The earliest linear attention approaches mainly imitated and approximated softmax attention. The simplest scheme directly removes \(\exp\), yielding:
For simplicity, we stipulate that matrix multiplication takes precedence over Hadamard product, allowing us to omit a set of parentheses. Why is this form "linear" attention? To understand this quickly, consider the non-causal version without \(\odot \boldsymbol{M}\): \(\boldsymbol{O} = (\boldsymbol{Q}\boldsymbol{K}^{\top})\boldsymbol{V} = \boldsymbol{Q}(\boldsymbol{K}^{\top}\boldsymbol{V})\). Note that computing \(\boldsymbol{K}^{\top}\boldsymbol{V}\) has complexity \(\mathcal{O}(nd^2)\), resulting in a \(d\times d\) matrix, then multiplying by \(\boldsymbol{Q}\) also has complexity \(\mathcal{O}(nd^2)\), so overall complexity is linear in \(n\).
For the causal version (6), we can understand it through component form:
If we denote the parenthesized part as \(\boldsymbol{S}_t\), then:
Thus, the causal attention form can be written as a linear RNN with state \(\boldsymbol{S}_t\), so each step's complexity is constant, and total complexity is proportional to sequence length \(n\). Note the emergence of "linear RNN"—a broader concept where linear attention constitutes one type of linear RNN. Linear RNNs have developed independently for some time, such as previously introduced LRU and SSM, but recently competitive linear architectures have taken the form of linear attention.
Early linear attention exhibited obvious imitative characteristics of softmax attention, such as adding denominators to Equation (6) for normalization. To enable normalization, \(\boldsymbol{k}_j^{\top} \boldsymbol{q}_t\) must be non-negative, leading to applying non-negative activation functions to \(\boldsymbol{Q},\boldsymbol{K}\). A series of works represented by Performer and RFA even constructed models aiming to approximate \(\exp(\boldsymbol{Q}\boldsymbol{K}^{\top})\).
However, subsequent research like "The Devil in Linear Transformer" found that normalization along the sequence dimension cannot completely avoid numerical instability; instead, post-hoc normalization is preferable, such as:
Since normalization is unnecessary, applying non-negative activation functions to \(\boldsymbol{Q},\boldsymbol{K}\) to ensure \(\boldsymbol{k}_j^{\top} \boldsymbol{q}_t\) non-negativity becomes non-essential. Does applying (not necessarily non-negative) activation functions to \(\boldsymbol{Q},\boldsymbol{K}\) still have value? The author's view is that adding activation functions is researchers' freedom; it's not excluded that some activation functions might yield better results. However, adding activation functions doesn't change the linear attention form, so it doesn't affect our description. Moreover, existing results show that not adding them is already sufficiently good.
Varied Forgetting Gates#
From Equation (8), we can see that current linear attention essentially performs \(\mathop{\text{cumsum}}\), equally weighting all historical information. It's not hard to imagine that when enough tokens accumulate, each token's information proportion becomes extremely small, making it difficult for the fixed-size \(\boldsymbol{S}_t\) matrix to accurately reconstruct any token. An intuitive analogy is each token's memory becoming blurred.
To alleviate this issue, RetNet introduced forgetting effects to linear attention:
where decay factor \(\gamma\in(0,1)\) is set as a constant in RetNet, though some make it trainable or change \(\gamma\) to a diagonal matrix, etc. The linear attention used in MiniMax-01 also follows this form. Note that decay factors existed before RetNet but mostly appeared in linear RNN forms like LRU and SSM mentioned earlier. RetNet was likely the first to combine it with linear attention. Adding decay factors makes models tend to forget more distant historical information, at least ensuring resolution for recent tokens—essentially embodying the "recency bias" characteristic aligned with language model properties, often leading to better performance.
Additionally, a noteworthy detail is that RetNet also applies RoPE to \(\boldsymbol{Q},\boldsymbol{K}\), equivalent to extending the decay factor to complex numbers \(\gamma e^{\text{i}\theta}\). From the LRU perspective, this considers complex eigenvalues. Although adding position encoding to RNNs might seem somewhat incongruous, some experiments like recent TransXSSM show that adding RoPE to linear attention also has certain positive effects. Of course, this might depend on specific model variants and experimental settings.
A simple generalization of Equation (10) replaces \(\gamma\) with a position-dependent function \(\gamma_t\), already seen in SSM. Later, works like DFW, Mamba, Mamba2 extended this to input-dependent forms, forming a series of works on "data-dependent decay," already very similar to forget gates in nonlinear RNNs like GRU and LSTM, except that to maintain model linearity, dependence on State (like \(\boldsymbol{S}_t\)) is removed.
Why do we prefer linear RNNs? Because linear RNNs can mostly find some way to parallelize training, making them more competitive than softmax attention—not inferior in either training or inference efficiency. Among these, the "universal solution" for parallelization converts to a Prefix Sum problem then uses Associative Scan; we briefly introduced this general approach in the "Parallelization" section of "Google's New Work Attempts to 'Revive' RNN: Can RNN Shine Again?".
However, the "universal solution" isn't GPU-efficient. GPUs are most efficient with matrix multiplication, so finding parallel algorithms that heavily use matrix multiplication is ideal. Even without parallelization, finding chunk-by-chunk recursive formats that fully utilize matrix multiplication can significantly improve training efficiency. This in turn imposes requirements on models, such as only outer product-form forget gates can achieve this goal. A typical counterexample is Mamba, which has non-outer-product forget gates, unable to fully leverage GPU performance, leading to subsequent variants like Mamba2 and GLA.
Test-Time Training#
So far, linear attention has evolved from simple imitation of softmax attention to introducing static decay factors and even "data-dependent decay," forming its own characteristics and demonstrating value in many tasks. However, most progress relied on manual experience-based design. We can't help but ask: Are there higher-level principles guiding the design of linear attention or even general sequence models (Token-Mixer)?
For this question, TTT (Test Time Training) provides its own answer, treating sequence model construction as an "online learning" problem and proposing using optimizers to build (not necessarily linear) RNNs. Specifically, it views \(\boldsymbol{K},\boldsymbol{V}\) as corpus pairs \((\boldsymbol{k}_1, \boldsymbol{v}_1),(\boldsymbol{k}_2, \boldsymbol{v}_2),\cdots,(\boldsymbol{k}_t, \boldsymbol{v}_t)\), trains a model \(\boldsymbol{v} = \boldsymbol{f}(\boldsymbol{S}_t;\boldsymbol{k})\) based on this corpus, and finally outputs \(\boldsymbol{o}_t = \boldsymbol{f}(\boldsymbol{S}_t;\boldsymbol{q}_t)\), where \(\boldsymbol{S}_t\) are model parameters, and the model architecture is largely arbitrary.
How does this relate to RNNs? Simple: optimizers like SGD, Adam, etc., are essentially RNNs over model parameters! Actually, this viewpoint isn't new; back in 2017 when Meta Learning was prevalent, researchers already proposed and utilized this, though the idea then was to try using RNNs (LSTM) to simulate better optimizers. Details can be found in "Optimization as a Model for Few-Shot Learning".
As they say, "What goes around comes around." After many years, TTT instead proposes building RNNs through optimizers. Its process is: First, current model parameters are \(\boldsymbol{S}_{t-1}\); the optimizer (SGD) receives new data \((\boldsymbol{k}_t, \boldsymbol{v}_t)\), updates model parameters to \(\boldsymbol{S}_t\) based on this data, and finally returns prediction result \(\boldsymbol{f}(\boldsymbol{S}_{t-1};\boldsymbol{q}_t)\) for \(\boldsymbol{q}_t\), and so on. Thus, the RNN implemented by TTT can be uniformly written as:
where \(\mathcal{L}(\boldsymbol{f}(\boldsymbol{S}_{t-1};\boldsymbol{k}_t), \boldsymbol{v}_t)\) is the loss function for current data \((\boldsymbol{k}_t, \boldsymbol{v}_t)\) under current parameters \(\boldsymbol{S}_{t-1}\), and \(\eta_t\) is the learning rate parameter. Referring to the previous section's "data-dependent decay," it can also be made data-dependent. This form can cover many RNN models; for instance, Equations (8) and (10) are both special cases:
| RNN | \(\boldsymbol{o}_t\) | \(\boldsymbol{f}(\boldsymbol{S};\boldsymbol{k})\) | \(\mathcal{L}(\boldsymbol{f}(\boldsymbol{S};\boldsymbol{k}),\boldsymbol{v})\) | \(\eta_t\) | |
|---|---|---|---|---|---|
| (8) | \(\boldsymbol{S}_t = \boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top}\) | \(\boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t\) | \(\boldsymbol{S}\boldsymbol{k}\) | \(-\boldsymbol{v}^{\top}(\boldsymbol{S}\boldsymbol{k})\) | 1 |
| (10) | \(\boldsymbol{S}_t = \gamma\boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top}\) | \(\boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t\) | \(\boldsymbol{S}\boldsymbol{k}\) | \(-\boldsymbol{v}^{\top}(\boldsymbol{S}\boldsymbol{k}) + \frac{1-\gamma}{2}\Vert\boldsymbol{S}\Vert_F^2\) | 1 |
TTT's original paper explores nonlinear RNNs under mini-batch settings; later, Titans adds momentum to TTT's SGD; subsequently, "Test-Time Training Done Right" explores large-batch TTT usage and investigates "TTT + Muon" combinations. Note that TTT only uses optimizers to build RNNs; trainable parameters outside the RNN, like those in \(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}\), are still trained with the overall optimizer after constructing the entire model.
A more thought-provoking question: Why can TTT serve as a "guiding principle" for building RNNs? The core goal of RNNs is to effectively compress historical data into a fixed-size State, and model parameters are exactly fixed-size. Training models somewhat resembles compressing training data into model weights; TTT precisely leverages its high alignment with RNN objectives. Put bluntly, if viewing RNNs as compression tasks, TTT treats model \(\boldsymbol{f}\) as the "decompressor," its weights as the "compressed package," and the compression algorithm as SGD, with compression rate being loss \(\mathcal{L}\).
Thus, we no longer need to spend effort constructing recursive formats; instead, we construct model \(\boldsymbol{f}\) and loss \(\mathcal{L}\). Whether an RNN is strong or reliable, we only need to examine corresponding \(\boldsymbol{f}\) and \(\mathcal{L}\) to have an idea.
Additionally, TTT using online learning to build RNNs means resulting RNNs necessarily align well with ICL (In-Context Learning) tasks—another advantage of TTT as a "guiding principle." Previously, "Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers" even reversed this, explaining ICL capability by removing softmax from softmax attention to get linear attention; from today's perspective, it constructed corresponding TTT.
Removing Old, Adding New#
For example, the earliest linear attention's corresponding loss function is \(-\boldsymbol{v}^{\top}(\boldsymbol{S}\boldsymbol{k})\), which seems unreliable at first glance because it's unbounded below, potentially causing \(\boldsymbol{S}\) to approach infinity. In contrast, RetNet adds an L2 regularization term to the loss function, avoiding this risk and from an optimization perspective also mitigating overfitting risk, yielding a better RNN.
However, using inner product as loss function, while concise and somewhat reasonable, doesn't directly encourage \(\boldsymbol{S}\boldsymbol{k}=\boldsymbol{v}\), so it's not an ideal regression loss. A better objective function should be squared loss: \(\frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2\). Substituting into TTT's formula (11) yields:
This is DeltaNet, named from "Parallelizing Linear Transformers with the Delta Rule over Sequence Length", earlier proposed by "Linear Transformers Are Secretly Fast Weight Programmers". Note that \(\eta_t (\boldsymbol{S}_{t-1} \boldsymbol{k}_t - \boldsymbol{v}_t)\boldsymbol{k}_t^{\top} = (\boldsymbol{S}_{t-1} (\sqrt{\eta_t}\boldsymbol{k}_t) - (\sqrt{\eta_t}\boldsymbol{v}_t))(\sqrt{\eta_t}\boldsymbol{k}_t)^{\top}\), meaning \(\eta_t\) can always be absorbed into definitions of \(\boldsymbol{k}_t,\boldsymbol{v}_t\), so subsequent analysis only considers \(\eta_t=1\):
If needed, we can replace \(\boldsymbol{k}_t,\boldsymbol{v}_t\) with \(\sqrt{\eta_t}\boldsymbol{k}_t,\sqrt{\eta_t}\boldsymbol{v}_t\) to restore \(\eta_t\). Comparing with linear attention's earliest form (8), DeltaNet's difference is subtracting \((\boldsymbol{S}_{t-1} \boldsymbol{k}_t)\boldsymbol{k}_t^{\top}\) before adding \(\boldsymbol{v}_t\boldsymbol{k}_t^{\top}\), where \(\boldsymbol{S}_{t-1} \boldsymbol{k}_t\) can be interpreted as the prediction result of new input \(\boldsymbol{k}_t\) under old model \(\boldsymbol{S}_{t-1}\).
Intuitively, "subtract then add" means first removing the model's old understanding of \(\boldsymbol{k}_t\), then supplementing new understanding based on \((\boldsymbol{k}_t,\boldsymbol{v}_t)\), achieving "remove old, add new" effect. This rule is called the "Delta Rule", precisely the source of "Delta" in DeltaNet. The Delta Rule isn't new; it's also called Least Mean Square, Widrow-Hoff Algorithm, etc., dating back to the 1960s. In fact, this field has few completely new things; many modifications can trace back to some "ancient" work, with current efforts mainly focusing on mining scalable parts.
Also note that chronologically, DeltaNet preceded TTT. Understanding RNNs from an online learning perspective had already sporadically appeared in some works before TTT, but TTT systematically proposed this "guiding principle" and applied it to build new RNN models, so we placed TTT first to make the entire introduction flow more naturally.
Some readers might question: Does DeltaNet still count as linear RNN? The answer is yes. By linear RNN, we mean the recurrence formula's dependency on State variables is linear, but dependency on inputs or \(\boldsymbol{q},\boldsymbol{k},\boldsymbol{v}\) can be nonlinear (though different dependency forms have varying parallel efficiency). From Equation (13), the right side always involves only the first power of \(\boldsymbol{S}_{t-1}\), so it satisfies the linear definition.
Inversion and Generalization#
As mentioned earlier, the most ideal (i.e., GPU-efficient) parallel algorithm for linear RNNs heavily utilizes matrix multiplication. To achieve this, first write DeltaNet as:
Let \(\boldsymbol{u}_t = \boldsymbol{v}_t - \boldsymbol{S}_{t-1} \boldsymbol{k}_t\), then \(\boldsymbol{S}_t = \boldsymbol{S}_{t-1} + \boldsymbol{u}_t\boldsymbol{k}_t^{\top}\). That is, it simply replaces \(\boldsymbol{V}\) with \(\boldsymbol{U}=[\boldsymbol{u}_1,\boldsymbol{u}_2,\cdots,\boldsymbol{u}_n]^{\top}\) in the earliest linear attention. Iterating \(t-1\) times, we have:
The final equality in matrix form is \(\boldsymbol{U} = \boldsymbol{V} - (\boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)\boldsymbol{U}\), where \(\boldsymbol{M}^-=\boldsymbol{M} - \boldsymbol{I}\). This is a linear system; its solution can be directly expressed as:
Here appears \((\boldsymbol{I}+\boldsymbol{B})^{-1}\), the inverse of an \(n\times n\) matrix, with standard complexity \(\mathcal{O}(n^3)\), higher than softmax attention! Fortunately, we don't need explicit inverse but only \(\boldsymbol{U}\), which can be transformed to solving linear system \((\boldsymbol{I}+\boldsymbol{B})\boldsymbol{U}=\boldsymbol{V}\), reducing complexity to \(\mathcal{O}(n^2)\). Further, leveraging that \(\boldsymbol{I}+\boldsymbol{B}\) is lower triangular and \(\boldsymbol{B}\)'s low-rank structure, complexity can be reduced to linear. After writing as block matrix multiplication, GPU can be fully utilized. These details must be read in the original papers; this article first explains the main mathematical principles.
After DeltaNet, Gated DeltaNet (GDN) further introduced forget gates into DeltaNet—a predictable variation. Gated DeltaNet's original formulation is:
But personally, this formulation explicitly breaks the Delta Rule; a better formulation would be like Comba, only multiplying the first \(\boldsymbol{S}_{t-1}\):
This corresponds to taking loss function \(\frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2 + \frac{1-\gamma}{\eta}\Vert\boldsymbol{S}\Vert_F^2\). Mathematically, these two formulations are equivalent:
i.e., \(\gamma_t = \alpha_t, \eta_t = \alpha_t \beta_t\), then absorbing \(1/\alpha_t\) into \(\boldsymbol{v}_t\) converts to the latter. So mathematically, these two forms have no difference; since most \(\alpha_t\) are close to 1, capability likely also has little difference (Comba says (18) is slightly better), but the latter more intuitively retains the Delta Rule form.
Theoretically, Gated DeltaNet can also be written in DeltaNet form: define \(\bar{\alpha}_t = \prod_{j=1}^t \alpha_t\), then dividing both sides of Equation (17) by \(\bar{\alpha}_t\) yields:
Combined with \(\boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t = (\bar{\alpha}_t^{-1}\boldsymbol{S}_t) (\bar{\alpha}_t\boldsymbol{q}_t)\), we find simply setting \(\bar{\alpha}_t\boldsymbol{q}_t,\bar{\alpha}_t^{-1}\boldsymbol{v}_t\) as new \(\boldsymbol{q}_t,\boldsymbol{v}_t\) reduces to DeltaNet form. However, this result only has theoretical derivation value in certain cases (e.g., deriving next section's attention matrix) because in practical computation, regardless of parameterization, for sufficiently large \(t\), either \(\bar{\alpha}_t\) or \(\bar{\alpha}_t^{-1}\) risks overflow.
After DeltaNet, another generalization is DeltaProduct, which expands \(\boldsymbol{k},\boldsymbol{v}\) several times then applies DeltaNet or Gated DeltaNet, attempting to enhance model state tracking capability. However, in the author's aesthetic judgment, rather than expanding by constant factors like DeltaProduct, it's better to attempt quadratic complexity RNNs like "The Space-Time Chapter: Treating Attention as Quadratic Complexity RNN", exploring opportunities to surpass softmax attention.
Back-Feeding Ongoing#
Speaking of surpassing softmax attention, as mentioned at the beginning, today's linear attention not only competes with softmax attention but even begins "back-feeding" it. This seems incredible but upon reflection isn't hard to understand. In some sense, softmax attention has been regressing over the years—from MHA, GQA to MQA, all making reductions to compress KV Cache. Linear attention has no KV Cache issue, so it progresses toward better directions.
To better see this, let's write aforementioned attention mechanisms in matrix form:
| Formula | |
|---|---|
| Softmax Attention | \((\exp(\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{M})\boldsymbol{V}\) |
| Earliest Linear Attention | \((\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{M})\boldsymbol{V}\) |
| With Forget Gates | \((\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{\Gamma})\boldsymbol{V}\) |
| DeltaNet | \((\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{M})(\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)^{-1}\boldsymbol{V}\) |
| Gated DeltaNet | \(\begin{aligned}((\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{M})(\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)^{-1}\odot\boldsymbol{\Gamma})\boldsymbol{V} \\ =(\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{\Gamma})(\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{\Gamma}^-)^{-1}\boldsymbol{V}\end{aligned}\) |
where
and \(\boldsymbol{\Gamma}^- = \boldsymbol{\Gamma} - \boldsymbol{I}\). Viewed this way, softmax attention's form remains only at the earliest linear attention stage (though this also proves its power). How is "back-feeding" achieved? First, we need a method to transform softmax attention into linear attention—not difficult. As early as "Transformer Upgrade Path 5: Linear Attention as Infinite-Dimensional", we summarized three schemes transforming softmax attention into infinite-dimensional linear attention.
In short, there exists a mapping \(\phi\) mapping \(\boldsymbol{Q},\boldsymbol{K}\) from \(n\times d\) to \(n\times \infty\), satisfying \(\exp(\boldsymbol{Q}\boldsymbol{K}^{\top}) = \phi(\boldsymbol{Q})\phi(\boldsymbol{K})^{\top}\)—called the "kernel trick." Then things become simple: we just replace linear attention's \(\boldsymbol{Q},\boldsymbol{K}\) in the above table with \(\phi(\boldsymbol{Q}),\phi(\boldsymbol{K})\), finally recover \(\exp\) and normalize, obtaining new softmax attention variants. For example, substituting into forget gate formula:
If \(\gamma_t\) takes constant value, this is actually "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation"'s proposed ALIBI; if \(\gamma_t\) is input-dependent, then it's "Forgetting Transformer: Softmax Attention with a Forget Gate"'s proposed FoX.
A more interesting result is "Understanding Transformer from the Perspective of Associative Memory"'s proposed DeltaFormer—as the name suggests, it's the softmax attention version of DeltaNet. Replacing DeltaNet's \(\boldsymbol{Q},\boldsymbol{K}\) with \(\phi(\boldsymbol{Q}),\phi(\boldsymbol{K})\):
If normalization needed, replace \(\exp\) with \(\text{softmax}\). Compared to softmax attention, DeltaFormer changes original \(\boldsymbol{A}\boldsymbol{V}\) to \(\boldsymbol{A}(\boldsymbol{I}+\boldsymbol{B})^{-1}\boldsymbol{V}\). Note:
So DeltaFormer essentially first computes attention multiple times with \(\boldsymbol{K},\boldsymbol{K},\boldsymbol{V}\), sums results as new \(\boldsymbol{V}\), then computes attention once with \(\boldsymbol{Q},\boldsymbol{K}\). This characteristic makes it particularly effective for Multi-Hop tasks (e.g., Code). Additionally, this feature means DeltaFormer pairs especially well with MQA, because the \((\boldsymbol{I}+\boldsymbol{B})^{-1}\boldsymbol{V}\) part only involves \(\boldsymbol{K},\boldsymbol{V}\), and for MQA, \(\boldsymbol{K},\boldsymbol{V}\) only have Single-Head, significantly reducing computation compared to MHA.
However, in the author's view, this fixed-coefficient summation might be "no free lunch"; for example, the author's experimental results show DeltaFormer's language model loss doesn't change much, meaning if certain tasks' losses significantly decrease, necessarily other tasks' losses increase.
Hardcore Encoding#
Another noteworthy back-feeding work is PaTH Attention, from "PaTH Attention: Position Encoding via Accumulating Householder Transformations", which back-feeds DeltaNet to softmax attention from a position encoding perspective.
In "Transformer Upgrade Path 6: Completeness Analysis of Rotary Position Encoding", we pointed out that for any orthogonal matrix \(\boldsymbol{\Omega}\), \(\boldsymbol{R}_m = \boldsymbol{\Omega}^m\) is generalized RoPE. Besides rotation matrices, what other easily constructed orthogonal matrices exist? PaTH uses Householder matrices: let \(\boldsymbol{w}\) be any column vector with length \(\sqrt{2}\), then \(\boldsymbol{I}-\boldsymbol{w}\boldsymbol{w}^{\top}\) is an orthogonal matrix, which we also derived in "Orthogonal Matrix Transforming One Unit Vector to Another", with geometric meaning being mirror reflection.
Easily seen, this is identical to \(\boldsymbol{I}-\boldsymbol{k}_t\boldsymbol{k}_t^{\top}\) multiplied by \(\boldsymbol{S}_{t-1}\) in DeltaNet, so PaTH directly copies this part, abandoning the \(\boldsymbol{\Omega}^m\) form and the \(\Vert\boldsymbol{w}\Vert=\sqrt{2}\) constraint, directly using a series of \(\boldsymbol{I}-\boldsymbol{w}\boldsymbol{w}^{\top}\) products to express position information:
Writing \(\boldsymbol{R}_{i,j}\) recursively: \(\boldsymbol{R}_{i,j} = (\boldsymbol{I}-\boldsymbol{w}_i\boldsymbol{w}_i^{\top})\boldsymbol{R}_{i-1,j},\boldsymbol{R}_{j,j} = \boldsymbol{I}\). Comparing with DeltaNet's Equation (13), the above equation corresponds to \(\boldsymbol{v}_t\) always zero, but initial value \(\boldsymbol{S}_0\) no longer zero. Using the same process as the "Matrix Inversion Assistance" section:
where \(\boldsymbol{W}=[\boldsymbol{w}_1,\boldsymbol{w}_2,\cdots,\boldsymbol{w}_n]^{\top}\), slices interpreted as in NumPy, e.g., \(\boldsymbol{W}_{[j:i]}=[\boldsymbol{w}_{j+1},\boldsymbol{w}_{j+2},\cdots,\boldsymbol{w}_i]^{\top}\), with slice priority higher than transpose. Note inversion is of a lower triangular matrix; triangular matrices have important properties: inverse matrix diagonal elements equal reciprocals of original matrix diagonal elements; if block triangular, diagonal blocks also satisfy this property, so we can write:
Subsequent transformations might be easier understood in component form; after detailed derivations (omitted here for brevity), we obtain the full (pre-softmax) attention matrix:
Impressed? There's more. Direct inversion complexity is \(\mathcal{O}(n^3)\), definitely unacceptable; we must leverage \(\boldsymbol{W}\boldsymbol{W}^{\top}\)'s low-rank characteristics to reduce complexity to \(\mathcal{O}(n^2)\), then derive backpropagation, finally implement efficiently similar to Flash Attention. These details must be explored in the original paper—extremely hardcore throughout.
From a position encoding perspective, PaTH is a type of CoPE (Contextual Position Encoding)—positions aren't indices \(1,2,3,\cdots\) but automatically generated position signals based on context content. Similarly, FoX can be seen as contextual ALIBI. Context-dependent position information is a main characteristic of current linear attention and possibly the main direction for back-feeding softmax attention.
Simplification, Infinite Delight#
Let's explore PaTH more deeply—this not only helps understand PaTH but also familiarizes us with DeltaNet, as they're highly related. This section examines two special cases of PaTH, helping better understand connections between PaTH and DeltaNet.
The first special case: \(\boldsymbol{W}=\boldsymbol{K}\), substituting into (28) yields:
Does this look familiar? This is exactly DeltaNet's attention matrix! From this special case, PaTH and DeltaFormer's difference lies in: DeltaFormer applies \(\exp\) to DeltaNet's \(\boldsymbol{Q}\boldsymbol{K}^{\top}\) and \(\boldsymbol{K}\boldsymbol{K}^{\top}\) based on kernel trick, while PaTH directly applies \(\exp\) to DeltaNet's attention matrix.
The second special case reintroduces \(\Vert\boldsymbol{w}\Vert=\sqrt{2}\) constraint. Then \(\boldsymbol{I}-\boldsymbol{w}\boldsymbol{w}^{\top}\) is orthogonal. Introducing:
Then \(\boldsymbol{R}_{i,j} = \boldsymbol{R}_i \boldsymbol{R}_j^{\top}\). This equality means we can implement relative-position PaTH using absolute position like RoPE: just multiply each \(\boldsymbol{q}_i^{\top},\boldsymbol{k}_i^{\top}\) by \(\boldsymbol{R}_i\), then apply standard softmax attention implementation. What operation is multiplying by \(\boldsymbol{R}_i\)? Repeating the previous section's expansion process, we obtain:
Does this look familiar? Actually, the second part is \(\text{DeltaNet}(\boldsymbol{Q},\boldsymbol{W},\boldsymbol{W})\)! So in this case, PaTH implementation equivalently achieves:
That is, using DeltaNet to add position encoding to \(\boldsymbol{Q},\boldsymbol{K}\). Viewed this way, PaTH (under \(\Vert\boldsymbol{w}\Vert=\sqrt{2}\) constraint) equates to some intra-layer mixture of softmax attention and DeltaNet. Of course, we could also consider abandoning previous derivations; even when \(\Vert\boldsymbol{w}\Vert\neq\sqrt{2}\), implement according to the above equation, similar to Canon Layers' scheme, using convolution to add position information to \(\boldsymbol{Q},\boldsymbol{K}\)—except here convolution isn't short convolution but DeltaNet-style long convolution.
Unconventional Approaches#
Finally, let's examine another recent linear attention model worth noting—MesaNet (and a similar contemporaneous work Atlas). TTT's online learning perspective tells us DeltaNet essentially uses SGD to optimize objective function \(\frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2\). Carefully observing, \(\boldsymbol{S}\boldsymbol{k}\) is only a linear function of \(\boldsymbol{k}\), so this is actually just a linear regression problem, and linear regression has analytical solution!
MesaNet uses this analytical solution to build sequence models, originating from "Uncovering mesa-optimization algorithms in Transformers", with efficient training implemented by "MesaNet: Sequence Modeling by Locally Optimal Test-Time Training". MesaNet adds forget gates to \(\boldsymbol{G}_t,\boldsymbol{H}_t\) in the above formula, then adds diagonal matrix \(\boldsymbol{\Lambda}_t\) during inversion to avoid singularity. The overall model is:
Clearly, \(\boldsymbol{G}_t,\boldsymbol{H}_t\) have linear complexity in sequence length, so \(\boldsymbol{o}_t\)'s computational complexity is also linear. Thus MesaNet still belongs to linear attention category, and due to analytical solution, basically ensures it outperforms DeltaNet and even Gated DeltaNet in most cases. From a signal processing perspective, MesaNet and DeltaNet are Recursive Least Square and Least Mean Square differences.
Seemingly all advantages—why does the author categorize it as "unconventional"? In the author's view, MesaNet "succeeds due to analytical solution, fails due to analytical solution." Analytical solution makes it generally outperform DeltaNet but also gives a "stop here" feeling, because slight changes rarely have chance to obtain analytical solution. Throughout mathematical history, all branches relying on analytical solutions have almost faded today because analytical solutions are too rare, too unrepresentative.
Implementation-wise, MesaNet needs to invert matrix \(\boldsymbol{H}_t + \boldsymbol{\Lambda}_t\), which isn't triangular. Although \((\boldsymbol{H}_t + \boldsymbol{\Lambda}_t)^{-1} \boldsymbol{q}_t\) can still be transformed to solving equations without explicit inverse, non-triangular nature increases solving complexity considerably. How to parallel compute all \((\boldsymbol{H}_t + \boldsymbol{\Lambda}_t)^{-1} \boldsymbol{q}_t\) as low-cost as possible will be MesaNet's long-term difficulty; current paper uses "conjugate gradient method" for approximate solutions—usable but not perfect.
Theoretically capability-wise, MesaNet isn't strictly superior to DeltaNet either. This is because MesaNet's \(\boldsymbol{G}_t,\boldsymbol{H}_t\) update rules are still simple moving averages, and its inversion doesn't involve token interactions, so its capability limit likely doesn't surpass DeltaNet with Delta Rule. Intuitively, MesaNet tries hard to remember all \(\boldsymbol{k},\boldsymbol{v}\); "wanting all" might lead to relatively blurred memory, while DeltaNet's principle is "remove old, add new"; due to "removing old," it can achieve long-term, precise memory of certain content.
We can also understand this non-optimality from a special example: So far, all attention except MesaNet allows K,V sharing choice—"allows" meaning not necessarily optimal but at least can train non-trivial results. However, MesaNet doesn't work, because if K,V identical, MesaNet's \(\boldsymbol{S}_t\) becomes constant identity matrix.
Overall, MesaNet is an aesthetically pleasing model, but analytical solution also increases its complexity and limits flexibility, leaving much room for exploration. Readers wanting to learn more about building sequence models based on linear regression can read TTR, which discusses various linear regression objective-based sequence models in detail.
Nascent Development#
This article briefly outlines the development trajectory of linear attention and introduces mathematical principles of some models. Linear attention started by imitating softmax attention, gradually developed its own characteristics, and now has become a highly competitive sequence modeling solution, even reciprocally providing new ideas for softmax attention development. This process itself is full of interest and inspiration.
From simple removal of exponential operations to introduction of forgetting mechanisms, from Delta Rule principles to TTT's online learning perspective, from hardware-efficient parallel algorithm design to back-feeding softmax attention, each advancement reflects the research community's deepening understanding of sequence modeling and continuous innovation in computational efficiency and model capability trade-offs.
Looking ahead, as model scales continue expanding and sequence lengths increase, linear attention's advantages will likely become more pronounced. Meanwhile, cross-fertilization between linear and softmax attention will likely yield more interesting results. The development path of linear attention remains nascent, with unlimited possibilities awaiting exploration.
Original Article: Su Jianlin. A Brief History of Linear Attention: From Imitation and Innovation to Back-Feeding. Scientific Spaces.
How to cite this translation:
BibTeX: