Series Continuation

In the previous article "The Path to Low-Rank Approximation (Part 2): SVD", we demonstrated that SVD provides the optimal low-rank approximation for any matrix. However, that optimal approximation is unconstrained—it only minimizes the error without considering specific matrix structures. In many practical applications, due to requirements for interpretability or nonlinear processing, we often seek approximation decompositions with particular structural characteristics.

Therefore, starting with this article, we will explore several low-rank approximations with specific structural constraints. This article focuses on Column-Row (CR) Approximation, which offers a simple scheme for accelerating matrix multiplication operations.

Problem Background#

The general formulation for optimal rank-$r$ approximation of a matrix is:

(1) \[ \mathop{\text{argmin}}_{\text{rank}(\tilde{\boldsymbol{M}})\leq r}\Vert \tilde{\boldsymbol{M}} - \boldsymbol{M}\Vert_F^2 \]

where $\boldsymbol{M},\tilde{\boldsymbol{M}}\in\mathbb{R}^{n\times m},r < \min(n,m)$. In the first two articles of this series, we have discussed two scenarios:

1. If $\tilde{\boldsymbol{M}}$ has no additional constraints, then the optimal solution for $\tilde{\boldsymbol{M}}$ is $\boldsymbol{U}_{[:,:r]}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{V}_{[:,:r]}^{\top}$, where $\boldsymbol{M}=\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}$ is the Singular Value Decomposition (SVD) of $\boldsymbol{M}$;

2. If we require $\tilde{\boldsymbol{M}}=\boldsymbol{A}\boldsymbol{B}$ (with $\boldsymbol{A}\in\mathbb{R}^{n\times r},\boldsymbol{B}\in\mathbb{R}^{r\times m}$) and $\boldsymbol{A}$ (or $\boldsymbol{B}$) is already given, then the optimal solution for $\tilde{\boldsymbol{M}}$ is $\boldsymbol{A} \boldsymbol{A}^{\dagger} \boldsymbol{M}$ (or $\boldsymbol{M} \boldsymbol{B}^{\dagger} \boldsymbol{B}$), where ${}^\dagger$ denotes the "pseudo-inverse".

Both results have extensive applications, but neither explicitly introduces structural relationships between $\tilde{\boldsymbol{M}}$ and $\boldsymbol{M}$. This makes it difficult to intuitively understand the connection between $\tilde{\boldsymbol{M}}$ and $\boldsymbol{M}$—in other words, $\tilde{\boldsymbol{M}}$ lacks strong interpretability.

Structured Approximation Motivation

Furthermore, when objectives involve nonlinear operations such as $\phi(\boldsymbol{X}\boldsymbol{W})$, we typically cannot use arbitrary real projection matrices for dimensionality reduction. Instead, we often require the use of "selection matrices." For instance, $\phi(\boldsymbol{X}\boldsymbol{W})\boldsymbol{S} = \phi(\boldsymbol{X}\boldsymbol{W}\boldsymbol{S})$ does not hold for arbitrary matrices $\boldsymbol{S}$, but it does hold for selection matrices $\boldsymbol{S}$.

Therefore, we now focus on low-rank approximation under selection matrix constraints. Specifically, we have $\boldsymbol{X}\in\mathbb{R}^{n\times l},\boldsymbol{Y}\in\mathbb{R}^{l\times m}$, with $\boldsymbol{M}=\boldsymbol{X}\boldsymbol{Y}$. Our task is to select $r$ columns from $\boldsymbol{X}$ and the corresponding $r$ rows from $\boldsymbol{Y}$ to construct $\tilde{\boldsymbol{M}}$:

(2) \[ \mathop{\text{argmin}}_S\Vert \underbrace{\boldsymbol{X}_{[:,S]}}_{\boldsymbol{C}}\underbrace{\boldsymbol{Y}_{[S,:]}}_{\boldsymbol{R}} - \boldsymbol{X}\boldsymbol{Y}\Vert_F^2\quad\text{s.t.}\quad S\subset \{0,1,\cdots,l-1\}, |S|=r \]

Here $S$ can be understood as a slice (following Python slicing conventions). We refer to $\boldsymbol{X}_{[:,S]}\boldsymbol{Y}_{[S,:]}$ as the "CR approximation" of $\boldsymbol{X}\boldsymbol{Y}$. Note that this slicing operation can also be equivalently described using selection matrices. Suppose the first $1,2,\cdots,r$ columns of $\boldsymbol{X}_{[:,S]}$ correspond to columns $s_1,s_2,\cdots,s_r$ of $\boldsymbol{X}$, respectively. We can then define a selection matrix $\boldsymbol{S}\in\{0,1\}^{l\times r}$:

(3) \[ S_{i,j}=\left\{\begin{aligned}&1, &i = s_j \\ &0, &i\neq s_j\end{aligned}\right. \]

That is, the $s_j$-th element of the $j$-th column of $\boldsymbol{S}$ equals 1, while all others equal 0. This yields $\boldsymbol{X}_{[:,S]}=\boldsymbol{X}\boldsymbol{S}$ and $\boldsymbol{Y}_{[S,:]}=\boldsymbol{S}^{\top} \boldsymbol{Y}$.

Preliminary Approximation#

If we express $\boldsymbol{X},\boldsymbol{Y}$ respectively as:

(4) \[ \boldsymbol{X} = (\boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_l),\quad \boldsymbol{Y}=\begin{pmatrix}\boldsymbol{y}_1^{\top} \\ \boldsymbol{y}_2^{\top} \\ \vdots \\ \boldsymbol{y}_l^{\top}\end{pmatrix} \]

where $\boldsymbol{x}_i\in\mathbb{R}^{n\times 1},\boldsymbol{y}_i\in\mathbb{R}^{m\times 1}$ are column vectors, then $\boldsymbol{X}\boldsymbol{Y}$ can be written as:

(5) \[ \boldsymbol{X}\boldsymbol{Y} = \sum_{i=1}^l \boldsymbol{x}_i\boldsymbol{y}_i^{\top} \]

Finding the optimal CR approximation of $\boldsymbol{X}\boldsymbol{Y}$ can be equivalently expressed as:

(6) \[ \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_l\in\{0,1\}}\left\Vert\sum_{i=1}^l \lambda_i \boldsymbol{x}_i\boldsymbol{y}_i^{\top} - \sum_{i=1}^l\boldsymbol{x}_i\boldsymbol{y}_i^{\top}\right\Vert_F^2\quad\text{s.t.}\quad \sum_{i=1}^l \lambda_i = r \]
Computational Complexity Analysis

We know that the Frobenius norm of a matrix essentially treats the matrix as a flattened vector and computes its length. Therefore, this optimization problem is fundamentally equivalent to: given $l$ vectors $\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l\in\mathbb{R}^d$, solve:

(7) \[ \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_l\in\{0,1\}}\left\Vert\sum_{i=1}^l \lambda_i \boldsymbol{v}_i - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2\quad\text{s.t.}\quad \sum_{i=1}^l \lambda_i = r \]

where $\boldsymbol{v}_i = \text{vec}(\boldsymbol{x}_i \boldsymbol{y}_i^{\top})$ and $d=nm$. Defining $\gamma_i = 1 - \lambda_i$, we can further simplify to:

(8) \[ \mathop{\text{argmin}}_{\gamma_1,\gamma_2,\cdots,\gamma_l\in\{0,1\}}\left\Vert\sum_{i=1}^l \gamma_i \boldsymbol{v}_i\right\Vert^2\quad\text{s.t.}\quad \sum_{i=1}^l \gamma_i = l-r \]

If I understand correctly, exact solution of this optimization problem is NP-Hard. Therefore, in general, we can only seek approximate algorithms.

A simple case that admits an exact solution occurs when $\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l$ are mutually orthogonal. In this scenario:

(9) \[ \left\Vert\sum_{i=1}^l \gamma_i \boldsymbol{v}_i\right\Vert^2 = \sum_{i=1}^l \gamma_i^2 \Vert\boldsymbol{v}_i\Vert^2 \]

Therefore, its minimum value is the sum of the smallest $l-r$ values of $\Vert\boldsymbol{v}_i\Vert^2$. That is, set $\gamma_i = 1$ for the $l-r$ vectors $\boldsymbol{v}_i$ with the smallest norms, and $\gamma_i = 0$ for the remaining ones. When the mutual orthogonality condition is not strictly satisfied, we can still use "selecting the $l-r$ vectors $\boldsymbol{v}_i$ with the smallest norms" as an approximate solution.

Returning to the original CR approximation problem, we have $\Vert\boldsymbol{x}_i\boldsymbol{y}_i^{\top}\Vert_F = \Vert\boldsymbol{x}_i\Vert \Vert \boldsymbol{y}_i\Vert$. Thus, a baseline approach for the optimal CR approximation of $\boldsymbol{X}\boldsymbol{Y}$ is to retain the $r$ column/row pairs with the largest products of column vector norms from $\boldsymbol{X}$ and the corresponding row vector norms from $\boldsymbol{Y}$.

Sampling Perspective#

In some scenarios, we can relax Equation $\eqref{eq:xy-l-k}$ to:

(10) \[ \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_l\in\mathbb{R}}\left\Vert\sum_{i=1}^l \lambda_i \boldsymbol{x}_i\boldsymbol{y}_i^{\top} - \sum_{i=1}^l\boldsymbol{x}_i\boldsymbol{y}_i^{\top}\right\Vert_F^2\quad\text{s.t.}\quad \sum_{i=1}^l \#[\lambda_i\neq 0] = r \]

where $\#[\lambda_i\neq 0]$ outputs 1 if $\lambda_i\neq 0$, and 0 otherwise. This relaxed version essentially extends the CR approximation form from $\boldsymbol{C}\boldsymbol{R}$ to $\boldsymbol{C}\boldsymbol{\Lambda}\boldsymbol{R}$, where $\boldsymbol{\Lambda}\in\mathbb{R}^{r\times r}$ is a diagonal matrix. That is, it allows us to adjust the diagonal matrix $\boldsymbol{\Lambda}$ to achieve higher accuracy. Correspondingly, Equation $\eqref{eq:v-l-k}$ becomes:

(11) \[ \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_l\in\mathbb{R}}\left\Vert\sum_{i=1}^l \lambda_i \boldsymbol{v}_i - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2\quad\text{s.t.}\quad \sum_{i=1}^l \#[\lambda_i\neq 0] = r \]
Probabilistic Sampling Approach

With this relaxation, we can approach the problem from a sampling perspective. First, we introduce an arbitrary $l$-element probability distribution $\boldsymbol{p}=(p_1,p_2,\cdots,p_l)$. We can then write:

(12) \[ \sum_{i=1}^l\boldsymbol{v}_i = \sum_{i=1}^l p_i\times\frac{\boldsymbol{v}_i}{p_i} = \mathbb{E}_{i\sim \boldsymbol{p}} \left[\frac{\boldsymbol{v}_i}{p_i}\right] \]

That is, the mathematical expectation of $\boldsymbol{v}_i/p_i$ precisely equals the target we aim to approximate. Therefore, we can construct an approximation through independent and identically distributed (i.i.d.) sampling from distribution $\boldsymbol{p}$:

(13) \[ \sum_{i=1}^l\boldsymbol{v}_i = \mathbb{E}_{i\sim \boldsymbol{p}} \left[\frac{\boldsymbol{v}_i}{p_i}\right] \approx \frac{1}{r}\sum_{j=1}^r \frac{\boldsymbol{v}_{s_j}}{p_{s_j}},\quad s_1,s_2,\cdots,s_r\sim \boldsymbol{p} \]

This implies that when $i$ is one of $s_1,s_2,\cdots,s_r$, we set $\lambda_i = (r p_i)^{-1}$; otherwise $\lambda_i=0$. Some readers might question why we use i.i.d. sampling instead of sampling without replacement, which might seem more appropriate for approximation purposes. The reason is simply that i.i.d. sampling makes subsequent analysis more straightforward.

Up to this point, our theoretical results are independent of the choice of distribution $\boldsymbol{p}$—meaning they hold for any $\boldsymbol{p}$. This provides us with the possibility to select an optimal $\boldsymbol{p}$. How do we measure the quality of $\boldsymbol{p}$? Clearly, we want the error of each sampling estimate to be as small as possible. Therefore, we can use the expected squared error of the sampling estimate:

(14) \[ \mathbb{E}_{i\sim \boldsymbol{p}} \left[\left\Vert\frac{\boldsymbol{v}_i}{p_i} - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2\right] = \left(\sum_{i=1}^l \frac{\Vert\boldsymbol{v}_i\Vert^2}{p_i}\right) - \left\Vert\sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2 \]

to compare the effectiveness of different $\boldsymbol{p}$ distributions. Using the inequality of arithmetic and geometric means (AM-GM inequality), we have:

(15) \[ \sum_{i=1}^l \frac{\Vert\boldsymbol{v}_i\Vert^2}{p_i} = \left(\sum_{i=1}^l \frac{\Vert\boldsymbol{v}_i\Vert^2}{p_i} + p_i Z^2\right) - Z^2\geq \left(\sum_{i=1}^l 2\Vert\boldsymbol{v}_i\Vert Z\right) - Z^2 \]

Equality is achieved when $\Vert\boldsymbol{v}_i\Vert^2 / p_i = p_i Z^2$, yielding the optimal $\boldsymbol{p}$:

(16) \[ p_i^* = \frac{\Vert\boldsymbol{v}_i\Vert}{Z},\quad Z = \sum\limits_{i=1}^l \Vert\boldsymbol{v}_i\Vert \]

The corresponding error becomes:

(17) \[ \mathbb{E}_{i\sim \boldsymbol{p}} \left[\left\Vert\frac{\boldsymbol{v}_i}{p_i} - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2\right] = \left(\sum_{i=1}^l \Vert\boldsymbol{v}_i\Vert\right)^2 - \left\Vert\sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2 \]
Historical Reference

The optimal $p_i$ is exactly proportional to $\Vert\boldsymbol{v}_i\Vert$. Therefore, the $r$ vectors $\boldsymbol{v}_i$ with the highest probabilities are precisely those with the largest norms, connecting this result with the approximation from the previous section. This result originates from the 2006 paper "Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication". Originally intended for accelerating matrix multiplication, it demonstrates that by sampling columns/rows of $\boldsymbol{X},\boldsymbol{Y}$ according to probabilities $p_i\propto \Vert \boldsymbol{x}_i\boldsymbol{y}_i^{\top}\Vert_F = \Vert \boldsymbol{x}_i\Vert \Vert\boldsymbol{y}_i\Vert$ and scaling by $(r p_i)^{-1/2}$, we can obtain a CR approximation of $\boldsymbol{X}\boldsymbol{Y}$, thereby reducing multiplication complexity from $\mathcal{O}(lmn)$ to $\mathcal{O}(rmn)$.

Extended Discussions#

Whether sorting by norm or random sampling according to $p_i\propto \Vert\boldsymbol{v}_i\Vert$, both methods allow us to construct a CR approximation with linear complexity [i.e., $\mathcal{O}(l)$]. This is certainly ideal for real-time computation. However, since both sorting and sampling depend solely on $\Vert\boldsymbol{v}_i\Vert$, the accuracy can only be considered moderate. If we can accept higher complexity, how can we improve the accuracy of CR approximation?

Greedy Algorithm for Improved Approximation

We can attempt to change the sorting unit to $k$-tuples. For simplicity, assume $k \leq l-r$ and $k$ divides $l-r$. The number of combinations of choosing $k$ vectors from $l$ vectors $\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l$ is $C_l^k$. For each combination $\{s_1,s_2,\cdots,s_k\}$, we can compute the norm of the vector sum $\Vert \boldsymbol{v}_{s_1} + \boldsymbol{v}_{s_2} + \cdots + \boldsymbol{v}_{s_k}\Vert$. With these data, we can greedily construct an approximate solution to $\eqref{eq:v-l-k-0}$:

Initialize: $\Omega = \{1,2,\cdots,l\},\Theta=\{\}$

For $t=1,2,\cdots,(l-r)/k$, execute:

  $\Theta = \Theta\,\cup\,\mathop{\text{argmin}}\limits_{\{s_1,s_2,\cdots,s_k\}\subset \Omega}\Vert \boldsymbol{v}_{s_1} + \boldsymbol{v}_{s_2} + \cdots + \boldsymbol{v}_{s_k}\Vert$;

  $\Omega = \Omega\,\backslash\,\Theta$;

Return $\Theta$.

Essentially, this algorithm repeatedly selects the $k$ vectors with the smallest combined norm from the remaining vectors, performing this selection $(l-r)/k$ times to obtain $l-r$ vectors. This is a natural generalization of sorting by individual vector norms.

Complexity Analysis

The complexity of this approach is $\mathcal{O}(C_l^k)$. When $k > 1$ and $l$ is relatively large, this may become computationally prohibitive, which indirectly reflects the complexity of exact solution for the original problem.

Another question worth considering is: if we allow the CR approximation to be relaxed to $\boldsymbol{C}\boldsymbol{\Lambda}\boldsymbol{R}$, what is the optimal solution for $\boldsymbol{\Lambda}$? If no structure is imposed on $\boldsymbol{\Lambda}$, the answer can be given by the pseudo-inverse:

(18) \[ \boldsymbol{\Lambda}^* = \mathop{\text{argmin}}_{\boldsymbol{\Lambda}}\Vert \boldsymbol{C}\boldsymbol{\Lambda}\boldsymbol{R} - \boldsymbol{X}\boldsymbol{Y}\Vert_F^2 = \boldsymbol{C}^{\dagger}\boldsymbol{X}\boldsymbol{Y}\boldsymbol{R}^{\dagger} \]

What if $\boldsymbol{\Lambda}$ must be a diagonal matrix? Then we can first reformulate the problem as: given $\{\boldsymbol{u}_1,\boldsymbol{u}_2,\cdots,\boldsymbol{u}_r\}\subset\{\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l\}$, find:

(19) \[ \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_r}\left\Vert\sum_{i=1}^r \lambda_i \boldsymbol{u}_i - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2 \]

Denote $\boldsymbol{U} = (\boldsymbol{u}_1,\boldsymbol{u}_2,\cdots,\boldsymbol{u}_r), \boldsymbol{V} = (\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l), \boldsymbol{\lambda}=(\lambda_1,\lambda_2,\cdots,\lambda_r)^{\top}$. The optimization objective can then be written as:

(20) \[ \mathop{\text{argmin}}_{\boldsymbol{\lambda}}\left\Vert\boldsymbol{U}\boldsymbol{\lambda} - \boldsymbol{V}\boldsymbol{1}_{l\times 1}\right\Vert^2 \]

This can also be expressed using the pseudo-inverse to obtain the optimal solution:

(21) \[ \boldsymbol{\lambda}^* = \boldsymbol{U}^{\dagger}\boldsymbol{V}\boldsymbol{1}_{l\times 1} = (\boldsymbol{U}^{\top}\boldsymbol{U})^{-1}\boldsymbol{U}^{\top}\boldsymbol{V}\boldsymbol{1}_{l\times 1} \]

The last equality assumes $\boldsymbol{U}^{\top}\boldsymbol{U}$ is invertible, which is typically satisfied. If not, replace $(\boldsymbol{U}^{\top}\boldsymbol{U})^{-1}$ with $(\boldsymbol{U}^{\top}\boldsymbol{U})^{\dagger}$.

The current issue is that directly applying this formula results in excessive computational cost for the original problem, because $\boldsymbol{v}_i = \text{vec}(\boldsymbol{x}_i \boldsymbol{y}_i^{\top})$—meaning $\boldsymbol{v}_i$ is an $mn$-dimensional vector. Consequently, $\boldsymbol{V}$ has size $mn\times l$ and $\boldsymbol{U}$ has size $mn\times r$, which becomes problematic when $m,n$ are large.

Utilizing the relationship $\boldsymbol{v}_i = \text{vec}(\boldsymbol{x}_i \boldsymbol{y}_i^{\top})$ helps us further simplify the above expression. Let $\boldsymbol{u}_i = \text{vec}(\boldsymbol{c}_i \boldsymbol{r}_i^{\top})$. Then:

(22) \[ \begin{aligned} (\boldsymbol{U}^{\top}\boldsymbol{V})_{i,j} &= \langle \boldsymbol{c}_i \boldsymbol{r}_i^{\top}, \boldsymbol{x}_j \boldsymbol{y}_j^{\top}\rangle_F \\ &= \text{Tr}(\boldsymbol{r}_i \boldsymbol{c}_i^{\top}\boldsymbol{x}_j \boldsymbol{y}_j^{\top}) \\ &= (\boldsymbol{c}_i^{\top}\boldsymbol{x}_j)(\boldsymbol{r}_i^{\top} \boldsymbol{y}_j) \\[5pt] &= [(\boldsymbol{C}^{\top}\boldsymbol{X})\odot (\boldsymbol{R}\boldsymbol{Y}^{\top})]_{i,j} \end{aligned} \]

That is, $\boldsymbol{U}^{\top}\boldsymbol{V}=(\boldsymbol{C}^{\top}\boldsymbol{X})\odot (\boldsymbol{R}\boldsymbol{Y}^{\top})$ and $\boldsymbol{U}^{\top}\boldsymbol{U}=(\boldsymbol{C}^{\top}\boldsymbol{C})\odot (\boldsymbol{R}\boldsymbol{R}^{\top})$, where $\odot$ denotes the Hadamard product. Through these identity transformations, the computational cost of $\boldsymbol{U}^{\top}\boldsymbol{V}$ and $\boldsymbol{U}^{\top}\boldsymbol{U}$ is significantly reduced.

Article Summary#

This article introduced Column-Row (CR) approximation for matrix multiplication, a structured low-rank approximation technique with specific column and row selection. Compared to the optimal low-rank approximation provided by SVD, CR approximation offers more intuitive physical interpretation and better explainability.

We explored:

  1. The fundamental formulation of CR approximation as a selection matrix constrained optimization problem
  2. NP-Hard nature of the exact solution and baseline approximation strategies based on norm sorting
  3. A probabilistic sampling perspective that connects to efficient matrix multiplication algorithms
  4. Extended discussions on greedy algorithms and computational optimizations for diagonal scaling matrices

CR approximation provides a practical trade-off between computational efficiency and structural interpretability, making it particularly valuable in scenarios where preserving specific matrix structure or enabling nonlinear operations is essential.

Citation Information

Original Article: Su Jianlin. The Path to Low-Rank Approximation (Part 3): CR Approximation. Scientific Spaces.

How to cite this translation:

Su, J. The Path to Low-Rank Approximation (Part 3): CR Approximation [Translated by ScalingOpt Team]. Scientific Spaces.

BibTeX:

@article{su2025lowrank_cr, title = {The Path to Low-Rank Approximation (Part 3): CR Approximation}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/10567}, note = {Translated by ScalingOpt Team} }