The Path to Low-Rank Approximation (Part 4): Interpolative Decomposition

Series Continuation

This article focuses on Interpolative Decomposition (ID). It can also be understood as a structured low-rank factorization where one side consists of several columns from the original matrix (or rows, if one prefers rows). In other words, ID attempts to identify several key columns as a "skeleton" (often also called a "sketch") to approximate the original matrix.

Practical Implementation

Many readers may not have heard of ID, and even Wikipedia provides only vague descriptions (link). However, ID—like SVD—has long been built into SciPy (see scipy.linalg.interpolative), which testifies to its practical value.

Basic Definition#

The first three articles introduced the pseudo-inverse, SVD, and CR approximation, respectively. All can be viewed as seeking low-rank approximations with specific structures:

(1) \[ \mathop{\text{argmin}}_{\text{rank}(\tilde{\boldsymbol{M}})\leq r}\Vert \tilde{\boldsymbol{M}} - \boldsymbol{M}\Vert_F^2 \]

where $\boldsymbol{M}\in\mathbb{R}^{n\times m}$. When no additional constraints are imposed, the optimal solution is given by SVD. When specifying $\tilde{\boldsymbol{M}}=\boldsymbol{A}\boldsymbol{B}$ and seeking the optimal solution for the other factor given one of $\boldsymbol{A},\boldsymbol{B}$, the optimal solution can be obtained via the pseudo-inverse. If we require $\boldsymbol{M}=\boldsymbol{X}\boldsymbol{Y}$ and $\tilde{\boldsymbol{M}}=\boldsymbol{X}_{[:, S]}\boldsymbol{Y}_{[S,:]}$, that becomes the problem addressed by CR approximation.

Interpolative Decomposition Formulation

CR approximation constructs a low-rank approximation by selecting rows/columns from the original matrix. This makes the approximation results more interpretable and also suitable for some nonlinear scenarios. However, CR approximation presupposes that matrix $\boldsymbol{M}$ originates from the product of two matrices, with its original purpose being to reduce matrix computation cost. For scenarios where matrix $\boldsymbol{M}$ is given directly, a similar low-rank approximation is provided by Interpolative Decomposition (ID).

Specifically, in ID we have $\tilde{\boldsymbol{M}}=\boldsymbol{C}\boldsymbol{Z}$, where $\boldsymbol{C}=\boldsymbol{M}_{[:,S]}$ consists of several columns of $\boldsymbol{M}$, and $\boldsymbol{Z}$ is arbitrary. That is, we use several columns of $\boldsymbol{M}$ as a skeleton to approximate itself:

(2) \[ \mathop{\text{argmin}}_{S,\boldsymbol{Z}}\Vert \underbrace{\boldsymbol{M}_{[:,S]}}_{\boldsymbol{C}}\boldsymbol{Z} - \boldsymbol{M}\Vert_F^2\quad\text{s.t.}\quad S\subset\{0,1,\cdots,m-1\},|S|=r,\boldsymbol{Z}\in\mathbb{R}^{r\times m} \]

According to the results from "The Path to Low-Rank Approximation (Part 1): Pseudo-Inverse", if $\boldsymbol{C}$ is already determined, then the optimal solution for $\boldsymbol{Z}$ is $\boldsymbol{C}^{\dagger} \boldsymbol{M}$. Therefore, the actual difficulty of ID lies only in optimizing $S$, i.e., column selection. This is a combinatorial optimization problem, and exact solution is NP-Hard. Consequently, the primary focus is on finding approximation algorithms with appropriate efficiency and accuracy.

Geometric Interpretation#

Before attempting to solve ID, let's further understand its geometric interpretation, which will help us better grasp its application scenarios and solution approaches. First, represent $\boldsymbol{C}$ in column vector form: $\boldsymbol{C}=(\boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r)$. Then for any column vector $\boldsymbol{z}=(z_1,z_2,\cdots,z_r)^{\top}$, we have:

(3) \[ \boldsymbol{C}\boldsymbol{z} = \begin{pmatrix}\boldsymbol{c}_1 & \boldsymbol{c}_2 & \cdots & \boldsymbol{c}_r\end{pmatrix}\begin{pmatrix}z_1 \\ z_2 \\ \vdots \\ z_r\end{pmatrix} = \sum_{i=1}^r z_i \boldsymbol{c}_i \]

Thus the geometric meaning of $\boldsymbol{C}\boldsymbol{z}$ is a linear combination of the column vectors of $\boldsymbol{C}$. Note that $\boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r$ are selected from $\boldsymbol{M}=(\boldsymbol{m}_1,\boldsymbol{m}_2,\cdots,\boldsymbol{m}_m)$. Therefore, ID essentially states that we select several columns as (approximate) basis vectors and represent all remaining columns as linear combinations of these bases. This is the meaning of "I (Interpolative, interpolation)" in ID.

Interpolative Constraint

We know that "Interpolative" more accurately means "interpolation." To better highlight this "interpolative" characteristic, some literature adds the condition $|z_{i,j}| \leq 1$ (where $z_{i,j}$ is any element of matrix $\boldsymbol{Z}$). Of course, this condition is actually quite restrictive. Ensuring it strictly holds is likely also NP-Hard, so many works relax it to $|z_{i,j}| \leq 2$. The practical performance of most approximation algorithms can satisfy this bound. If there are no other requirements and we only consider minimizing approximation error, this restriction can also be removed.

QR Decomposition#

Algorithm Classification

ID solution algorithms are divided into two main categories: deterministic algorithms and randomized algorithms. Deterministic algorithms typically involve higher computational cost but often yield better approximation quality, whereas randomized algorithms offer higher computational efficiency but slightly lower accuracy. Note that both are merely approximation algorithms with acceptable practical performance, and neither excludes the possibility of extreme cases where they completely fail.

The first algorithm considered standard is based on QR decomposition—more precisely, Column-Pivoting QR decomposition, often translated as "列主元QR分解" in Chinese (though I personally feel a more intuitive translation would be "列驱QR分解"). It is a deterministic algorithm. Why is ID associated with QR decomposition? We can understand this by examining the approach to finding $\boldsymbol{Z}$.

Geometric Perspective via QR

As mentioned earlier, if $\boldsymbol{C}$ is already given, the optimal solution for $\boldsymbol{Z}$ is $\boldsymbol{C}^{\dagger}\boldsymbol{M}$. This answer is correct but not very intuitive. Without loss of generality, assume $\boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r$ are linearly independent. From a geometric perspective, finding the optimal approximation in the form $\boldsymbol{C}\boldsymbol{Z}$ essentially involves projecting each column vector of $\boldsymbol{M}$ onto the $r$-dimensional subspace spanned by $\boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r$. To compute this projection result, we can first perform Gram-Schmidt orthogonalization on $\boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r$ to transform them into an orthonormal basis. Projection onto an orthonormal basis becomes much simpler, and the orthogonalization process naturally corresponds to QR decomposition.

Gram-Schmidt orthogonalization recursively executes the following steps:

(4) \[ \boldsymbol{q}_1 = \frac{\boldsymbol{c}_1}{\Vert\boldsymbol{c}_1\Vert},\quad \boldsymbol{q}_k = \frac{\hat{\boldsymbol{q}}_k}{\Vert\hat{\boldsymbol{q}}_k\Vert},\quad\hat{\boldsymbol{q}}_k = \boldsymbol{c}_k - \sum_{i=1}^{k-1} (\boldsymbol{c}_k^{\top} \boldsymbol{q}_i)\boldsymbol{q}_i,\quad k = 2,3,\cdots,r \]

The result expresses $\boldsymbol{C}$ as:

(5) \[ \boldsymbol{C} = \underbrace{\begin{pmatrix}\boldsymbol{q}_1 & \boldsymbol{q}_2 & \cdots & \boldsymbol{q}_r\end{pmatrix}}_{\boldsymbol{Q}}\underbrace{\begin{pmatrix}R_{1,1} & R_{1,2} & \cdots & R_{1,r} \\ 0 & R_{2,2} & \cdots & R_{2,r} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{r,r} \\ \end{pmatrix}}_{\boldsymbol{R}} \]

With $\boldsymbol{q}_1,\boldsymbol{q}_2,\cdots,\boldsymbol{q}_r$, the optimal approximation of the $k$-th column $\boldsymbol{m}_k$ of matrix $\boldsymbol{M}$ onto $\boldsymbol{C}$ and its error are respectively:

(6) \[ \sum_{i=1}^r (\boldsymbol{m}_k^{\top} \boldsymbol{q}_i)\boldsymbol{q}_i\qquad\text{and}\qquad \left\Vert\boldsymbol{m}_k - \sum_{i=1}^r (\boldsymbol{m}_k^{\top} \boldsymbol{q}_i)\boldsymbol{q}_i\right\Vert^2 \]

Column-Pivoting QR#

Of course, the above results are obtained under the premise that $\boldsymbol{C}$ is known. How do we select a relatively good set of $r$ columns from $\boldsymbol{M}$ to form $\boldsymbol{C}$? Column-pivoting QR decomposition provides one reference answer.

Column-Pivoting QR Algorithm

Generally, if we perform Gram-Schmidt orthogonalization on $\boldsymbol{m}_1,\boldsymbol{m}_2,\cdots,\boldsymbol{m}_m$, we proceed in order: starting from $\boldsymbol{m}_1$, then $\boldsymbol{m}_2,\boldsymbol{m}_3,\cdots$. Column-pivoting QR decomposition modifies the orthogonalization order based on norm magnitude. Written in formulas:

(7) \[ \begin{gathered} \boldsymbol{q}_1 = \frac{\boldsymbol{m}_{\rho_1}}{\Vert\boldsymbol{m}_{\rho_1}\Vert},\quad \boldsymbol{q}_k = \frac{\hat{\boldsymbol{q}}_k}{\Vert\hat{\boldsymbol{q}}_k\Vert},\quad\hat{\boldsymbol{q}}_k = \boldsymbol{m}_{\rho_k} - \sum_{i=1}^{k-1} (\boldsymbol{m}_{\rho_k}^{\top} \boldsymbol{q}_i)\boldsymbol{q}_i \\ \rho_1 = \mathop{\text{argmax}}_{i\in\{1,2,\cdots,m\}} \Vert \boldsymbol{m}_i\Vert,\quad \rho_k = \mathop{\text{argmax}}_{i\in\{1,2,\cdots,m\}\backslash\{\rho_1,\rho_2,\cdots,\rho_{k-1}\}} \left\Vert \boldsymbol{m}_i - \sum_{j=1}^{k-1} (\boldsymbol{m}_i^{\top} \boldsymbol{q}_j)\boldsymbol{q}_j\right\Vert \end{gathered} \]

In essence, column-pivoting QR decomposition selects the column with the largest remaining error at each step to perform orthogonal normalization. Apart from the changed execution order, column-pivoting QR decomposition shares no other computational differences with standard QR decomposition. Therefore, the final form of column-pivoting QR decomposition can be expressed as:

(8) \[ \boldsymbol{M}\boldsymbol{\Pi} = \underbrace{\begin{pmatrix}\boldsymbol{q}_1 & \boldsymbol{q}_2 & \cdots & \boldsymbol{q}_m\end{pmatrix}}_{\boldsymbol{Q}}\underbrace{\begin{pmatrix}R_{1,1} & R_{1,2} & \cdots & R_{1,m} \\ 0 & R_{2,2} & \cdots & R_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{m,m} \\ \end{pmatrix}}_{\boldsymbol{R}} \]

where $\boldsymbol{\Pi}$ is a column permutation matrix. Based on the operation of selecting the column with maximum error (norm) at each step, we obtain that for any $k$, the first column of submatrix $\boldsymbol{R}_{[k-1:,k-1:]}$ has the largest norm, which is no less than the norm of any remaining column:

(9) \[ R_{k,k}^2 \geq \sum_{i=k}^j R_{i,j}^2,\quad \forall j = k,k+1,\cdots,m \]

From this we can also deduce $|R_{1,1}|\geq |R_{2,2}| \geq \cdots\geq |R_{m,m}|$. These properties lead us to believe that if we want an $r$-rank approximation of $\boldsymbol{M}\boldsymbol{\Pi}$, retaining only the first $r$ rows of $\boldsymbol{R}$ should be a good choice:

(10) \[ \boldsymbol{M}\boldsymbol{\Pi} = \boldsymbol{Q}\boldsymbol{R} \approx \boldsymbol{Q}_{[:,:r]}\boldsymbol{R}_{[:r,:]}=\boldsymbol{Q}_{[:,:r]}\big[\boldsymbol{R}_{[:r,:r]},\boldsymbol{R}_{[:r,r:]}\big]=\boldsymbol{Q}_{[:,:r]}\boldsymbol{R}_{[:r,:r]}\big[\boldsymbol{I}_r,\boldsymbol{R}_{[:r,:r]}^{-1}\boldsymbol{R}_{[:r,r:]}\big] \]

Recall our earlier convention that slicing has priority over inversion, so here $\boldsymbol{R}_{[:r,:r]}^{-1}$ means $(\boldsymbol{R}_{[:r,:r]})^{-1}$. It is not difficult to see that $\boldsymbol{Q}_{[:,:r]}\boldsymbol{R}_{[:r,:r]}$ actually represents $r$ columns of matrix $\boldsymbol{M}$. Thus the above expression actually provides an ID approximation:

(11) \[ \boldsymbol{M} \approx \boldsymbol{C}\boldsymbol{Z},\quad \boldsymbol{C}=\boldsymbol{Q}_{[:,:r]}\boldsymbol{R}_{[:r,:r]},\quad \boldsymbol{Z}=\big[\boldsymbol{I}_r,\boldsymbol{R}_{[:r,:r]}^{-1}\boldsymbol{R}_{[:r,r:]}\big]\boldsymbol{\Pi}^{\top} \]

SciPy Implementation

The above describes the ID solution algorithm based on column-pivoting QR decomposition, which is also the built-in solving algorithm in SciPy (with rand=False). Note that this algorithm cannot guarantee $|z_{i,j}| \leq 1$ or $|z_{i,j}| \leq 2$. However, according to feedback from many publications, in practice it almost never yields $|z_{i,j}| > 2$, making it a relatively robust solving algorithm. Additionally, SciPy also includes column-pivoting QR decomposition; it can be activated by setting pivoting=True in scipy.linalg.qr.

Randomized Algorithms#

Randomized Linear Algebra Approaches

Each orthogonalization step in column-pivoting QR decomposition requires traversing all remaining vectors to select the one with maximum error, which becomes infeasible when $m$ is large. Additionally, if $n$ is large, the computational cost of norm and inner product calculations becomes high. Consequently, randomized algorithms emerged, aiming to reduce the values of $n$ or $m$ to lower computational complexity.

First, consider the approach to reduce $n$, i.e., reduce the dimension of each column vector of $\boldsymbol{M}$. A common method is random projection, similar to the "JL Lemma" introduced in "The Amazing Johnson-Lindenstrauss Lemma: Theory". Specifically, assume $\boldsymbol{\Omega}\in\mathbb{R}^{d\times n}$ is some random projection matrix (where $d\ll n$) whose elements are independently and identically sampled from some distribution such as $\mathcal{N}(0,1/n)$. We then perform column-pivoting QR decomposition on the smaller matrix $\boldsymbol{\Omega}\boldsymbol{M}\in\mathbb{R}^{d\times m}$ to determine the positions of the selected $r$ columns. More detailed discussion can be found in "Randomized algorithms for pivoting and for computing interpolatory and CUR factorizations".

Structured Random Projections

Based on my limited research, SciPy's randomized algorithm for solving ID also uses a similar idea, but replaces the random sampling matrix with the more structured "Subsampled Randomized Fourier Transform (SRFT)", reducing the computational cost of the step $\boldsymbol{\Omega}\boldsymbol{M}$ from $\mathcal{O}(mnd)$ to $\mathcal{O}(mn\log d)$. However, I am not familiar with the details of SRFT and SciPy's implementation. Interested readers may refer to "Enabling very large-scale matrix computations via randomization", "A brief introduction to Randomized Linear Algebra", and other materials for further investigation.

Another reason for not delving deeply into complex random projection methods like SRFT is that the paper "Efficient Algorithms for Constructing an Interpolative Decomposition" found that simpler column sampling often yields better results and is particularly easy to understand: randomly sample $k > r$ columns from $\boldsymbol{M}$, then use column-pivoting QR decomposition to select $r$ columns from these $k$ columns as $\boldsymbol{C}$, and finally solve for $\boldsymbol{Z}$ based on $\boldsymbol{C}$. This reduces the matrix size for column-pivoting QR decomposition from $n\times m$ to $n\times k$.

Experimental Comparisons

Experiments show that this simple approach at most slightly increases the risk of $|z_{i,j}| > 2$ in individual tasks but offers clear advantages in error:

Comparison of maximum absolute values of Z matrix between column sampling (Optim-RID) and SciPy built-in algorithm (SciPy-RID)

Error comparison between column sampling (Optim-RID) and SciPy built-in algorithm (SciPy-RID)

Efficiency comparison between column sampling (Optim-RID) and SciPy built-in algorithm (SciPy-RID)

Improving Accuracy#

Beyond Greedy Algorithms

From the above tables, one might notice a potentially surprising result: the randomized column sampling method (Optim-RID) not only outperforms the similarly randomized SciPy-RID in terms of error, but in some tasks even outperforms deterministic algorithms SciPy-ID and Optim-ID (which are mathematically equivalent, both based on complete column-pivoting QR decomposition, differing only in implementation efficiency).

This seemingly counterintuitive phenomenon actually illustrates a fact: although column-pivoting QR decomposition can serve as a good baseline for ID, its ability to select basis vectors may not be significantly better than random selection. The main role of column-pivoting QR decomposition is to guarantee with high probability that $|z_{i,j}| < 2$. This is not difficult to understand. Take $r=1$ as an example: column-pivoting QR decomposition simply returns the column with the largest norm. But is the column with the largest norm necessarily a good basis (one that minimizes reconstruction error)? Clearly not. A good basis should represent the direction shared by most vectors, and maximum norm does not reflect this.

For ID, column-pivoting QR decomposition is essentially a greedy algorithm that transforms the problem of selecting $r$ columns into multiple recursive selections of 1 column. When $r=1$, in scenarios where $m$ is not too large or high precision is required, the complexity of exact solution via enumeration is acceptable:

(12) \[ \mathop{\text{argmin}}_i \sum_{j=1}^m \left\Vert\boldsymbol{m}_j - \frac{(\boldsymbol{m}_j^{\top} \boldsymbol{m}_i)\boldsymbol{m}_i}{\Vert\boldsymbol{m}_i\Vert^2}\right\Vert^2 \]

That is, traverse all $\boldsymbol{m}_i$, project all remaining columns onto $\boldsymbol{m}_i$ to compute the total error, and select the $\boldsymbol{m}_i$ with minimum total error. Its complexity is proportional to $m^2$. If we replace the operation of selecting the column with maximum norm at each step of column-pivoting QR decomposition with selecting the $\boldsymbol{m}_i$ that minimizes the total error in the above expression, we can identify better bases, thereby achieving lower reconstruction error (at the cost, naturally, of higher complexity and even less guarantee that $|z_{i,j}| < 2$).

Further Research Directions

In summary, due to the NP-Hard nature of exact solution, ID has numerous solution approaches. The ones listed above are only a very limited selection. Interested readers can explore further using keywords such as Randomized Linear Algebra and Column Subset Selection. It is particularly worth noting that Randomized Linear Algebra, which aims to accelerate various matrix operations through randomized methods, has itself become a rich discipline. The randomized ID discussed in this article and the sampling-based CR approximation from the previous article are both classic examples within this field.

Article Summary#

This article introduced Interpolative Decomposition (ID), which approximates the original matrix by selecting several columns as a "skeleton." It is a structured low-rank factorization with relatively intuitive geometric interpretation. The core difficulty lies in column selection, which is fundamentally an NP-Hard discrete optimization problem.

Key points covered include:

Formal Definition: ID seeks $\boldsymbol{M} \approx \boldsymbol{C}\boldsymbol{Z}$ where $\boldsymbol{C}$ consists of selected columns from $\boldsymbol{M}$
Geometric Interpretation: ID represents remaining columns as linear combinations of selected basis columns
Deterministic Algorithm: Column-pivoting QR decomposition as the standard approach
Randomized Methods: Random projection and column sampling techniques for scalability
Accuracy Trade-offs: Exploration of greedy vs. exhaustive search approaches
Practical Implementation: SciPy integration and experimental comparisons

ID provides a valuable alternative to SVD and CR approximation when interpretability and preservation of original matrix structure are important considerations in low-rank approximation tasks.

Citation Information

Original Article: Su Jianlin. The Path to Low-Rank Approximation (Part 4): Interpolative Decomposition. Scientific Spaces.

How to cite this translation:

Su, J. The Path to Low-Rank Approximation (Part 4): Interpolative Decomposition [Translated by ScalingOpt Team]. Scientific Spaces.

BibTeX:

@article{su2025lowrank_id, title = {The Path to Low-Rank Approximation (Part 4): Interpolative Decomposition}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/10568}, note = {Translated by ScalingOpt Team} }

The Path to Low-Rank Approximation (Part 4): Interpolative Decomposition (ID)

Academic Translation

Basic Definition#

Geometric Interpretation#

QR Decomposition#

Column-Pivoting QR#

Randomized Algorithms#

Improving Accuracy#

Article Summary#