Problem Context

From the article "A Brief History of Linear Attention: From Imitation and Innovation to Back-Feeding", we observe that DeltaNet and subsequent linear attention models are fundamentally associated with the inverse matrix \((\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot\boldsymbol{M}^-)^{-1}\). This article specifically investigates efficient computation methods for inverting such triangular matrices characterized by "diagonal plus low-rank" structure.

Basic Results#

Problem Formalization

We formally define the problem as follows:

Given matrices \(\boldsymbol{Q},\boldsymbol{K}\in\mathbb{R}^{n\times d}\) and a diagonal matrix \(\boldsymbol{\Lambda}\in\mathbb{R}^{n\times n}\), with \(n\gg d\), define \[ \boldsymbol{T} = \boldsymbol{\Lambda} + \boldsymbol{Q}\boldsymbol{K}^{\top}\odot\boldsymbol{M}^- \] where \(\boldsymbol{M}^-=\boldsymbol{M} - \boldsymbol{I}\), and matrix \(\boldsymbol{M}\) is defined as \[ M_{i,j} = \left\{\begin{aligned} &1, &i \geq j \\ &0, &i < j\end{aligned}\right. \] We seek to compute the inverse matrix \(\boldsymbol{T}^{-1}\) and prove its computational complexity is \(\mathcal{O}(n^2)\).

First, without the lower triangular constraint \(\odot\boldsymbol{M}^-\), this could be directly solved using the "Woodbury matrix identity":

(1) \[ (\boldsymbol{\Lambda} + \boldsymbol{Q}\boldsymbol{K}^{\top})^{-1} = \boldsymbol{\Lambda}^{-1} - \boldsymbol{\Lambda}^{-1} \boldsymbol{Q}(\boldsymbol{I} + \boldsymbol{K}^{\top}\boldsymbol{\Lambda}^{-1}\boldsymbol{Q})^{-1}\boldsymbol{K}^{\top}\boldsymbol{\Lambda}^{-1} \]

It's easily verified that the right-hand side computation complexity is \(\mathcal{O}(n^2)\). However, with the addition of \(\odot\boldsymbol{M}^-\), \(\boldsymbol{T}\) itself no longer maintains the "diagonal plus low-rank" structure, thus cannot be directly solved by this identity. Given the lower triangular characteristic, a fundamental approach is recursion, as we have the block matrix identity:

(2) \[ \begin{bmatrix}\boldsymbol{A} & \boldsymbol{0} \\ \boldsymbol{C} & \boldsymbol{B}\end{bmatrix}^{-1} = \begin{bmatrix}\boldsymbol{A}^{-1} & \boldsymbol{0} \\ -\boldsymbol{B}^{-1}\boldsymbol{C}\boldsymbol{A}^{-1} & \boldsymbol{B}^{-1}\end{bmatrix} \]

This allows us to transform \(\boldsymbol{T}^{-1}\) into recursive form (convention: without parentheses, slicing has highest priority):

(3) \[ \boldsymbol{T}_{[:l+1,:l+1]}^{-1} = \begin{bmatrix}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{0} \\ -\boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\boldsymbol{T}_{[l:l+1,:l]}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\end{bmatrix} \]

The primary computation involves \(\boldsymbol{T}_{[l:l+1,:l]}\boldsymbol{T}_{[:l,:l]}^{-1}\), which multiplies a \(1\times l\) matrix with an \(l\times l\) matrix, yielding complexity \(\mathcal{O}(l^2)\). That is, each iteration's complexity grows quadratically, resulting in total complexity \(\mathcal{O}(n^3)\).

Low-Rank Structure#

Leveraging Low-Rank Structure

Naturally, this is because we haven't yet utilized the low-rank structure of \(\boldsymbol{T}\) (before \(\odot\boldsymbol{M}^-\)). Now incorporating this structure, we obtain \(\boldsymbol{T}_{[l:l+1,:l]} = \boldsymbol{Q}_{[l:l+1]}\boldsymbol{K}_{[:l]}^{\top}\). Substituting into the above equation yields:

(4) \[ \boldsymbol{T}_{[:l+1,:l+1]}^{-1} = \begin{bmatrix}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{0} \\ -\boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\boldsymbol{Q}_{[l:l+1]}\boldsymbol{K}_{[:l]}^{\top}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\end{bmatrix} \]

Note that \(\boldsymbol{K}_{[:l]}^{\top}\boldsymbol{T}_{[:l,:l]}^{-1}\in\mathbb{R}^{d\times l}\). If we can treat it as a recursive variable, then each iteration's complexity reduces to \(\mathcal{O}(l)\), successfully lowering total complexity to \(\mathcal{O}(n^2)\). Following this approach, we have:

(5) \[ \begin{aligned} \boldsymbol{K}_{[:l+1]}^{\top}\boldsymbol{T}_{[:l+1,:l+1]}^{-1} =&\, \begin{bmatrix}\boldsymbol{K}_{[:l]}^{\top} & \boldsymbol{K}_{[l:l+1]}^{\top}\end{bmatrix}\begin{bmatrix}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{0} \\ -\boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\boldsymbol{Q}_{[l:l+1]}\boldsymbol{K}_{[:l]}^{\top}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\end{bmatrix} \\[6pt] =&\, \begin{bmatrix}\boldsymbol{K}_{[:l]}^{\top}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{0}\end{bmatrix} + \boldsymbol{K}_{[l:l+1]}^{\top}\underbrace{\begin{bmatrix}-\boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\boldsymbol{Q}_{[l:l+1]}\boldsymbol{K}_{[:l]}^{\top}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\end{bmatrix}}_{\text{which is exactly }(\boldsymbol{T}^{-1})_{[l:l+1,:l+1]}} \end{aligned} \]

We observe this recursive process also avoids \(\mathcal{O}(l^2)\) operations, so the approach is feasible—we simply need to introduce a new variable to cache \(\boldsymbol{K}_{[:l]}^{\top}\boldsymbol{T}_{[:l,:l]}^{-1}\). Replacing \(l+1\) with \(l+c\) yields chunked recursive format.

Complexity Analysis

The key insight is that by maintaining the intermediate result \(\boldsymbol{K}_{[:l]}^{\top}\boldsymbol{T}_{[:l,:l]}^{-1}\) as a recursive variable, we avoid expensive matrix-matrix multiplications involving dimensions growing with \(l\). Instead, we only perform:

  • \(\mathcal{O}(c^3)\) operations for inverting the \(c \times c\) diagonal block
  • \(\mathcal{O}(c^2 d)\) operations for matrix multiplications involving the cached variable
  • \(\mathcal{O}(c d^2)\) operations for updating the cached variable

Since \(d\) and \(c\) are constants independent of \(n\), the total complexity becomes \(\mathcal{O}(n)\) per column or \(\mathcal{O}(n^2)\) for the full inverse.

Implementation and Verification

Test code for the full inverse computation:

Python Implementation
import numpy as np

n, d, c = 1000, 100, 200
Q = np.random.randn(n, d) / d**0.5
K = np.random.randn(n, d) / d**0.5
T = np.tril(Q @ K.T, -1) + np.eye(n)

# Initialize matrices
Y, Z = np.zeros((n, n)), np.zeros((d, n))

# Chunked recursive computation
for l in range(0, n, c):
    # Invert diagonal block
    Y[l:l + c, l:l + c] = np.linalg.inv(T[l:l + c, l:l + c])
    
    # Compute off-diagonal blocks using cached variable
    Y[l:l + c, :l] = - Y[l:l + c, l:l + c] @ Q[l:l + c] @ Z[:, :l]
    
    # Update cached variable
    Z[:, :l + c] += K[l:l + c].T @ Y[l:l + c, :l + c]

# Verification
print("Verification result:", np.allclose(Y @ T, np.eye(n)))

This implementation demonstrates the \(\mathcal{O}(n^2)\) complexity algorithm for computing the full inverse matrix \(\boldsymbol{T}^{-1}\). The chunk size \(c\) provides a tunable parameter balancing memory access patterns and computational efficiency.

Multiplication Computation#

Matrix-Vector Multiplication

Using the same approach, we can also demonstrate:

For any matrix \(\boldsymbol{V}\in\mathbb{R}^{n\times d}\), computing \(\boldsymbol{T}^{-1}\boldsymbol{V}\) requires only \(\mathcal{O}(n)\) complexity.

The proof requires only slight modifications to the previous process. First, we have:

(6) \[ \begin{aligned} (\boldsymbol{T}^{-1}\boldsymbol{V})_{[:l+1]} =&\, \boldsymbol{T}_{[:l+1,:l+1]}^{-1}\boldsymbol{V}_{[:l+1]} \\[6pt] =&\, \begin{bmatrix}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{0} \\ -\boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\boldsymbol{Q}_{[l:l+1]}\boldsymbol{K}_{[:l]}^{\top}\boldsymbol{T}_{[:l,:l]}^{-1} & \boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\end{bmatrix}\begin{bmatrix}\boldsymbol{V}_{[:l]} \\ \boldsymbol{V}_{[l:l+1]}\end{bmatrix} \\[6pt] =&\, \begin{bmatrix}\boldsymbol{T}_{[:l,:l]}^{-1}\boldsymbol{V}_{[:l]} \\ -\boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\boldsymbol{Q}_{[l:l+1]}\boldsymbol{K}_{[:l]}^{\top}\boldsymbol{T}_{[:l,:l]}^{-1}\boldsymbol{V}_{[:l]} + \boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}\boldsymbol{V}_{[l:l+1]}\end{bmatrix} \\[6pt] =&\, \begin{bmatrix}(\boldsymbol{T}^{-1}\boldsymbol{V})_{[:l]} \\ \boldsymbol{T}_{[l:l+1,l:l+1]}^{-1}(\boldsymbol{V}_{[l:l+1]} - \boldsymbol{Q}_{[l:l+1]}\boldsymbol{K}_{[:l]}^{\top}(\boldsymbol{T}^{-1}\boldsymbol{V})_{[:l]})\end{bmatrix} \end{aligned} \]

Then:

(7) \[ \begin{aligned} \boldsymbol{K}_{[:l+1]}^{\top}(\boldsymbol{T}^{-1}\boldsymbol{V})_{[:l+1]} =&\, \begin{bmatrix}\boldsymbol{K}_{[:l]}^{\top} & \boldsymbol{K}_{[l:l+1]}^{\top}\end{bmatrix}\begin{bmatrix}(\boldsymbol{T}^{-1}\boldsymbol{V})_{[:l]} \\ (\boldsymbol{T}^{-1}\boldsymbol{V})_{[l:l+1]} \end{bmatrix} \\[8pt] =&\,\boldsymbol{K}_{[:l]}^{\top}(\boldsymbol{T}^{-1}\boldsymbol{V})_{[:l]} + \boldsymbol{K}_{[l:l+1]}^{\top}(\boldsymbol{T}^{-1}\boldsymbol{V})_{[l:l+1]} \end{aligned} \]

Therefore, caching \(\boldsymbol{K}_{[:l]}^{\top}(\boldsymbol{T}^{-1}\boldsymbol{V})_{[:l]}\in\mathbb{R}^{d\times d}\) suffices to make each step's computational complexity independent of \(l\), yielding total complexity \(\mathcal{O}(n)\). Similarly, replacing \(l+1\) with \(l+c\) produces chunked format.

Complexity Advantage

The \(\mathcal{O}(n)\) complexity for computing \(\boldsymbol{T}^{-1}\boldsymbol{V}\) represents a significant improvement over the \(\mathcal{O}(n^3)\) naive approach or even the \(\mathcal{O}(n^2)\) full inverse computation. This efficiency is particularly valuable in linear attention applications where we typically need \(\boldsymbol{T}^{-1}\boldsymbol{V}\) rather than the full inverse matrix.

The key optimization lies in maintaining the small \(d \times d\) matrix \(\boldsymbol{K}_{[:l]}^{\top}(\boldsymbol{T}^{-1}\boldsymbol{V})_{[:l]}\) as state, avoiding operations that scale with the growing dimension \(l\). Each chunk operation involves:

  • \(\mathcal{O}(c^3)\) for diagonal block inversion
  • \(\mathcal{O}(c^2 d)\) for matrix-vector operations
  • \(\mathcal{O}(d^2)\) for state updates

With \(n/c\) chunks, the total complexity is \(\mathcal{O}(n)\) for fixed \(c\) and \(d\).

Implementation for Matrix Multiplication

Test code for computing \(\boldsymbol{T}^{-1}\boldsymbol{V}\):

Python Implementation
import numpy as np

n, d, c = 1000, 100, 200
Q = np.random.randn(n, d) / d**0.5
K = np.random.randn(n, d) / d**0.5
V = np.random.randn(n, d) / d**0.5
T = np.tril(Q @ K.T, -1) + np.eye(n)

# Initialize matrices
Y, Z = np.zeros((n, d)), np.zeros((d, d))

# Chunked recursive computation
for l in range(0, n, c):
    # Invert diagonal block
    X = np.linalg.inv(T[l:l + c, l:l + c])
    
    # Compute current chunk using cached state
    Y[l:l + c] = X @ (V[l:l + c] - Q[l:l + c] @ Z)
    
    # Update cached state
    Z += K[l:l + c].T @ Y[l:l + c]

# Verification
print("Verification result:", np.allclose(T @ Y, V))

This implementation efficiently computes \(\boldsymbol{T}^{-1}\boldsymbol{V}\) with \(\mathcal{O}(n)\) complexity, making it highly practical for large-scale applications in linear attention models where such computations are frequent.

Article Summary#

Summary and Implications

This article examines efficient inversion methods for triangular matrices with "diagonal plus low-rank" structure, which commonly appear in modern linear attention models. The key contributions are:

  1. Full Inverse Computation: We presented an \(\mathcal{O}(n^2)\) algorithm for computing the complete inverse \(\boldsymbol{T}^{-1}\) by leveraging the low-rank structure and maintaining appropriate cached variables during recursive computation.
  2. Matrix Multiplication Optimization: We demonstrated that computing \(\boldsymbol{T}^{-1}\boldsymbol{V}\) can be achieved with \(\mathcal{O}(n)\) complexity, a significant improvement relevant for practical applications.
  3. Chunked Implementation: Both algorithms can be implemented in chunked formats, providing flexibility for memory management and parallelization opportunities.
  4. Practical Verification: Python implementations validate the correctness and efficiency of the proposed methods.

These results have important implications for linear attention architectures like DeltaNet, where such matrix inversions are fundamental operations. The \(\mathcal{O}(n^2)\) and \(\mathcal{O}(n)\) complexities represent substantial improvements over naive \(\mathcal{O}(n^3)\) approaches, making these models more computationally feasible for long-sequence applications.

The techniques developed here—leveraging structural properties, maintaining appropriate cached state, and employing chunked recursion—provide a template for optimizing other structured matrix computations in machine learning and numerical linear algebra.

Citation Information

Original Article: Su Jianlin. Efficient Inversion Methods for "Diagonal + Low-Rank" Triangular Matrices. Scientific Spaces.

How to cite this translation:

Su, J. Efficient Inversion Methods for "Diagonal + Low-Rank" Triangular Matrices [Translated by Juanxi Tian]. Scientific Spaces.

BibTeX:

@article{su2025efficient_inversion_diagonal_low_rank, title = {Efficient Inversion Methods for "Diagonal + Low-Rank" Triangular Matrices}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/11218}, note = {Translated by Juanxi Tian (ScalingOpt Team)} }