Research Context

It is well known that although Transformer models based on attention mechanisms exhibit good parallel performance, their space and time complexity are both $\mathcal{O}(n^2)$, where $n$ is the sequence length. Consequently, when $n$ is large, the computational requirements of Transformer models become prohibitive. Recently, numerous efforts have been devoted to reducing Transformer computational costs, including model pruning, quantization, distillation, and other compression techniques, as well as modifying the attention structure to reduce complexity to $\mathcal{O}(n\log n)$ or even $\mathcal{O}(n)$.

Recently, I read the paper "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", which introduced me to the exploration of linear attention. After reviewing several related papers, I gained valuable insights and have compiled my understanding of linear attention in this article.

Attention Mechanisms#

Standard Attention Formulation

Currently, the most popular attention mechanism is Scaled-Dot Attention, formulated as:

(1) \[ \text{Attention}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}) = \text{softmax}\left(\boldsymbol{Q}\boldsymbol{K}^{\top}\right)\boldsymbol{V} \]

Here, $\boldsymbol{Q}\in\mathbb{R}^{n\times d_k}$, $\boldsymbol{K}\in\mathbb{R}^{m\times d_k}$, and $\boldsymbol{V}\in\mathbb{R}^{m\times d_v}$. For simplicity, we omit the explicit scaling factor of attention. Since this article focuses on self-attention scenarios, we uniformly set $\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}\in\mathbb{R}^{n\times d}$ for convenience in presentation. In general scenarios, we have $n > d$ or even $n\gg d$ (e.g., in BERT base, $d=64$). For related interpretations, readers can refer to my articles "A Brief Reading of 'Attention is All You Need' (Introduction + Code)", as well as some of its improvements in "Breaking Through the Bottleneck: Building a More Powerful Transformer" and "Google's New Work Synthesizer: We Still Don't Fully Understand Self-Attention", which will not be discussed in depth here.

Removing Softmax#

Readers might not realize that the key factor limiting attention performance is actually the Softmax in its definition!

In fact, a simple derivation reveals this conclusion. The step $\boldsymbol{Q}\boldsymbol{K}^{\top}$ yields an $n\times n$ matrix, which determines that the complexity of attention is $\mathcal{O}(n^2)$. Without Softmax, we would have three matrix multiplications: $\boldsymbol{Q}\boldsymbol{K}^{\top}\boldsymbol{V}$. Since matrix multiplication satisfies associativity, we can first compute $\boldsymbol{K}^{\top}\boldsymbol{V}$, obtaining a $d\times d$ matrix, and then left-multiply it by $\boldsymbol{Q}$. Since $d \ll n$, the approximate complexity of this approach is only $\mathcal{O}(n)$ (with the left multiplication by $\boldsymbol{Q}$ dominating the computation).

Complexity Breakthrough

In other words, removing Softmax from attention can reduce the complexity to the ideal linear level $\mathcal{O}(n)$! This is exactly our ultimate goal: Linear Attention, attention with linear complexity. Therefore, the theme of this article is to explore linear attention after removing Softmax.

Generalized Definition#

Generalized Attention Formulation

The question is: can direct removal of Softmax still be considered attention? Can it achieve the effects of standard attention? To answer this, we first equivalently rewrite the definition of Scaled-Dot Attention from equation (1) as (vectors in this article are column vectors):

(2) \[ \text{Attention}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})_i = \frac{\sum\limits_{j=1}^n e^{\boldsymbol{q}_i^{\top}\boldsymbol{k}_j}\boldsymbol{v}_j}{\sum\limits_{j=1}^n e^{\boldsymbol{q}_i^{\top}\boldsymbol{k}_j}} \]

Thus, Scaled-Dot Attention is essentially a weighted average of $\boldsymbol{v}_j$ with weights $e^{\boldsymbol{q}_i^{\top}\boldsymbol{k}_j}$. Therefore, we can propose a generalized definition of attention:

(3) \[ \text{Attention}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})_i = \frac{\sum\limits_{j=1}^n \text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)\boldsymbol{v}_j}{\sum\limits_{j=1}^n \text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)} \]

That is, we replace $e^{\boldsymbol{q}_i^{\top}\boldsymbol{k}_j}$ with a general function $\text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)$ of $\boldsymbol{q}_i, \boldsymbol{k}_j$. To preserve the similar distribution characteristics of attention, we require $\text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)\geq 0$ to hold consistently. In other words, if we want to define a new form of attention, we should retain the form of equation (3) and satisfy $\text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)\geq 0$.

This general form of attention is also known as Non-Local Networks in computer vision, originating from the paper "Non-local Neural Networks".

Examples#

Practical Implementations

If we directly remove Softmax, then $\text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j) = \boldsymbol{q}_i^{\top}\boldsymbol{k}_j$. However, the problem is that inner products cannot guarantee non-negativity, so this is not a reasonable choice. Below, we briefly introduce several feasible schemes.

Note: Among the linear attention methods introduced below, the first two come from the computer vision field, while the third is my own conception. None have been extensively tested on NLP tasks yet, providing experimental directions for NLP researchers working on model improvements (^_^). Incidentally, there are numerous improvements to attention in the CV field (beyond those mentioned below, such as EMANet), many of which are worth referencing for NLP researchers.

Kernel Function Form#

Kernel-Based Approach

A natural idea is: if each element of $\boldsymbol{q}_i,\boldsymbol{k}_j$ is non-negative, then the inner product will naturally be non-negative. To achieve this, we can apply activation functions $\phi,\varphi$ to $\boldsymbol{q}_i,\boldsymbol{k}_j$ respectively:

(4) \[ \text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j) = \phi(\boldsymbol{q}_i)^{\top} \varphi(\boldsymbol{k}_j) \]

where $\phi(\cdot),\varphi(\cdot)$ are activation functions with non-negative ranges. The paper "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" mentioned at the beginning of this article chooses $\phi(x)=\varphi(x)=\text{elu}(x)+1$.

If we must tell a story, equation (4) can be associated with "kernel methods", especially when $\phi=\varphi$, where $\phi$ acts as a kernel function, and $\langle \phi(\boldsymbol{q}_i), \phi(\boldsymbol{k}_j)\rangle$ is the inner product defined by the kernel function. Relevant considerations can be found in the paper "Transformer dissection: An unified understanding for transformer's attention via the lens of kernel", which will not be elaborated further here.

Ingenious Use of Softmax#

Dimension-Wise Normalization

An earlier paper, "Efficient Attention: Attention with Linear Complexities", presents a more interesting alternative. It notes that in $\boldsymbol{Q}\boldsymbol{K}^{\top}$, where $\boldsymbol{Q}, \boldsymbol{K}, \in\mathbb{R}^{n\times d}$, if "$\boldsymbol{Q}$ is normalized along the $d$ dimension and $\boldsymbol{K}$ is normalized along the $n$ dimension", then $\boldsymbol{Q}\boldsymbol{K}^{\top}$ automatically satisfies normalization. Thus, it proposes:

(5) \[ \text{Attention}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}) = \text{softmax}_2\left(\boldsymbol{Q}\right)\text{softmax}_1(\boldsymbol{K})^{\top}\boldsymbol{V} \]

where $\text{softmax}_1$ and $\text{softmax}_2$ denote Softmax operations along the first ($n$) and second ($d$) dimensions, respectively. In other words, we apply Softmax to $\boldsymbol{Q},\boldsymbol{K}$ individually, rather than after computing $\boldsymbol{Q}\boldsymbol{K}^{\top}$.

If we directly take $\phi(\boldsymbol{q}_i)=\text{softmax}(\boldsymbol{q}_i),\varphi(\boldsymbol{k}_j)=\text{softmax}(\boldsymbol{k}_j)$, then clearly this form is also a special case of equation (4). Moreover, this design has appeared multiple times in computer vision, such as in A2-Nets, which also includes similar approaches.

Personal Conception#

Novel Formulation

Here, I present my own conception. The starting point of this idea is not equation (4), but rather an approximation of our original definition from equation (2). From Taylor expansion, we have:

(6) \[ e^{\boldsymbol{q}_i^{\top}\boldsymbol{k}_j} \approx 1 + \boldsymbol{q}_i^{\top}\boldsymbol{k}_j \]

If $\boldsymbol{q}_i^{\top}\boldsymbol{k}_j\geq -1$, then the right-hand side is guaranteed to be non-negative, allowing us to set $\text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)=1 + \boldsymbol{q}_i^{\top}\boldsymbol{k}_j$. By now, readers might have realized that to ensure $\boldsymbol{q}_i^{\top}\boldsymbol{k}_j\geq -1$, we only need to normalize $\boldsymbol{q}_i,\boldsymbol{k}_j$ separately using $l_2$ normalization. Therefore, the final scheme I propose is:

(7) \[ \text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j) = 1 + \left( \frac{\boldsymbol{q}_i}{\Vert \boldsymbol{q}_i\Vert}\right)^{\top}\left(\frac{\boldsymbol{k}_j}{\Vert \boldsymbol{k}_j\Vert}\right) \]

Although this differs from the form in equation (4), it is theoretically closer to the original Scaled-Dot Attention.

Comparative Analysis

There are many works focused on reducing the computational complexity of attention by modifying its form. Below, we briefly enumerate some of them.

Sparse Attention#

Sparse Pattern Approaches

We previously introduced OpenAI's Sparse Attention, which reduces attention computation by "retaining only values in small regions and forcing most attention values to zero." After special design, most elements of the attention matrix become zero, thus theoretically saving memory usage and computation. Subsequent similar works include "Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection" and "Longformer: The Long-Document Transformer".

However, this approach has two obvious shortcomings:

1. How to select the attention regions to retain is determined subjectively by humans, lacking intelligence.

2. It requires specific design optimizations in programming to achieve efficient implementation, making it difficult to generalize.

Reformer#

LSH-Based Optimization

Reformer is another representative improvement work that reduces attention complexity to $\mathcal{O}(n\log n)$. In a sense, Reformer is also a form of sparse attention, but its sparse pattern is not predetermined. Instead, it uses LSH (Locality Sensitive Hashing) technology to (approximately) quickly find the largest several attention values and compute only those values. Additionally, Reformer replaces the original FFN (Feedforward Network) with a reversible form and redesigns the backpropagation process to reduce memory usage.

Thus, compared to the aforementioned sparse attention, Reformer addresses its first shortcoming but still suffers from the second: implementation complexity. Implementing LSH-based attention is much more complex than standard attention, and rewriting the backpropagation process for reversible networks is beyond the reach of most readers.

Linformer#

Dimensionality Reduction Approach

A work very similar to the linear attention discussed in this article is Facebook's recently released Linformer, which retains the original Scaled-Dot Attention form but projects $\boldsymbol{K},\boldsymbol{V}$ using two $m\times n$ matrices $\boldsymbol{E},\boldsymbol{F}$ before applying attention:

(8) \[ \text{Attention}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}) = \text{softmax}\left(\boldsymbol{Q}(\boldsymbol{E}\boldsymbol{K})^{\top}\right)\boldsymbol{F}\boldsymbol{V} \]

Thus, $\boldsymbol{Q}(\boldsymbol{E}\boldsymbol{K})^{\top}$ becomes only an $n\times m$ matrix. The authors claim that even for large sequence lengths $n$, $m$ can remain a moderate constant, making this attention linear as well. Similar ideas to Linformer appear in earlier CV papers such as "Asymmetric Non-local Neural Networks for Semantic Segmentation".

Critique: However, I believe the claim that "$m$ can remain unchanged for ultra-long sequences" is questionable. The original paper only tested on MLM tasks for long sequences, and MLM does not heavily rely on long-range dependencies, making this experiment less convincing. Therefore, whether Linformer is truly linear remains debatable.

Autoregressive Generation#

Sequence Generation Considerations

Another drawback of Linformer is that the operations $\boldsymbol{E}\boldsymbol{K},\boldsymbol{F}\boldsymbol{V}$ directly "blend" information from the entire sequence, making it difficult to simply mask future information (Causal Masking). Consequently, it cannot perform language modeling, Seq2Seq, and other autoregressive generation tasks, which explains why the original authors only tested MLM tasks. In contrast, the linear attention methods introduced in this article can all achieve this. Taking equations (3) and (4) as examples, if we want to mask future information, we only need to change the summation $\sum\limits_{j=1}^n$ to $\sum\limits_{j=1}^i$:

(9) \[ \text{Attention}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})_i = \frac{\sum\limits_{j=1}^i \left(\phi(\boldsymbol{q}_i)^{\top} \varphi(\boldsymbol{k}_j)\right)\boldsymbol{v}_j}{\sum\limits_{j=1}^i \phi(\boldsymbol{q}_i)^{\top} \varphi(\boldsymbol{k}_j)}=\frac{ \phi(\boldsymbol{q}_i)^{\top} \sum\limits_{j=1}^i\varphi(\boldsymbol{k}_j)\boldsymbol{v}_j^{\top}}{ \phi(\boldsymbol{q}_i)^{\top} \sum\limits_{j=1}^i\varphi(\boldsymbol{k}_j)} \]

There are two ways to implement this equation. The first method involves setting $\boldsymbol{S}_i=\sum\limits_{j=1}^i\varphi(\boldsymbol{k}_j)\boldsymbol{v}_j^{\top}$ and $\boldsymbol{z}_i=\sum\limits_{j=1}^i\varphi(\boldsymbol{k}_j)$:

(10) \[ \text{Attention}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})_i =\frac{ \phi(\boldsymbol{q}_i)^{\top} \boldsymbol{S}_i}{ \phi(\boldsymbol{q}_i)^{\top} \boldsymbol{z}_i},\quad \begin{aligned}&\boldsymbol{S}_i=\boldsymbol{S}_{i-1}+\varphi(\boldsymbol{k}_i)\boldsymbol{v}_i^{\top}\\ &\boldsymbol{z}_i=\boldsymbol{z}_{i-1}+\varphi(\boldsymbol{k}_i) \end{aligned} \]

This indicates that such attention can be implemented recursively as an RNN model, with minimal space complexity but requiring serial computation, making it suitable for predictive decoding. The second method involves directly computing the outer product of $\varphi(\boldsymbol{K}),\boldsymbol{V}\in\mathbb{R}^{n\times d}$, yielding an $n\times d\times d$ matrix, and then applying $\text{cumsum}$ along the $n$ dimension, thus obtaining $\boldsymbol{S}_1,\boldsymbol{S}_2,\dots,\boldsymbol{S}_n$ in one go. This approach is fastest but consumes the most memory, making it suitable for training. However, often $d^2\gg n$, and in many cases, the memory requirements during training are prohibitive. Therefore, the RNN form is often preferred.

Downsampling Techniques#

Sequence Length Reduction

From the results, Linformer's $\boldsymbol{E}\boldsymbol{K}, \boldsymbol{F}\boldsymbol{V}$ essentially shorten (downsample) the sequence. The most straightforward method to shorten a sequence is Pooling, so I previously attempted to incorporate Pooling techniques into Transformers. Recent similar works include IBM's "PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination" and Google's "Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing". Besides Pooling, there are other downsampling techniques, such as using 1D convolutions with stride > 1. Based on this idea, perhaps we could replace the Position-Wise fully connected layers in FFN with 1D convolutions with stride > 1? In any case, many variations can be explored in this direction, but like Linformer, such blending makes autoregressive generation challenging.

Conclusion#

Summary and Implications

This article introduces several works that modify the structure of attention to reduce its computational complexity. The main idea is to remove the Softmax from standard attention, allowing attention complexity to degrade to the ideal $\mathcal{O}(n)$ level (Linear Attention). Compared to other similar improvement structures, this modification not only reduces complexity to $\mathcal{O}(n)$ but also retains all "token-token" attention relationships and preserves the possibility of autoregressive generation.

The removal of Softmax enables linear complexity while maintaining essential attention properties.

The key contributions of this exploration are:

  1. Complexity Reduction: Achieving $\mathcal{O}(n)$ complexity compared to the original $\mathcal{O}(n^2)$ of standard attention.
  2. Generalized Formulation: Proposing a generalized attention definition that allows for various similarity functions beyond exponential dot products.
  3. Practical Implementations: Presenting multiple practical approaches including kernel methods, dimension-wise normalization, and normalized inner products.
  4. Comparative Analysis: Contrasting linear attention with other efficiency-focused approaches like sparse attention, Reformer, and Linformer.
  5. Autoregressive Compatibility: Maintaining the ability to perform causal masking for autoregressive generation tasks.

These findings suggest that the Softmax operation, while integral to standard attention formulations, may not be strictly necessary for achieving effective attention mechanisms. The exploration of linear attention opens new avenues for efficient Transformer architectures capable of handling longer sequences while preserving modeling capabilities.

Future research directions include empirical validation of these linear attention variants on diverse NLP tasks, investigation of hybrid approaches combining different efficiency techniques, and exploration of theoretical guarantees for approximation quality compared to standard attention.

Citation Information

Original Article: Su Jianlin. Exploring Linear Attention: Must Attention Have a Softmax? Scientific Spaces.

How to cite this translation:

Su, J. Exploring Linear Attention: Must Attention Have a Softmax? [Translated by Academic Translation Team]. Scientific Spaces.

BibTeX:

@article{su2025exploring_linear_attention, title = {Exploring Linear Attention: Must Attention Have a Softmax?}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/7546}, note = {Translated by Academic Translation Team} }