The Path to Low-Rank Approximation (Part 1): Pseudo-Inverse

Series Introduction

Many readers, like the author, may experience a sense of familiarity mixed with unfamiliarity regarding low-rank matrix approximation. Familiarity arises because the concept and significance of low-rank approximation are not difficult to grasp, and techniques such as LoRA (Low-Rank Adaptation) based on low-rank approximation have proliferated, embedding the concept deeply in our collective consciousness through constant exposure. However, low-rank approximation encompasses a vast territory, and research papers in this area frequently introduce novel techniques that are both unfamiliar and awe-inspiring, leading to a sense of partial understanding or ambiguity.

Therefore, in this series of articles, I will attempt to systematically organize the theoretical content related to low-rank matrix approximation to complete our understanding of this important topic. In this first article, we focus on a relatively simple concept within the low-rank approximation series—the pseudo-inverse.

Optimization Perspective#

The pseudo-inverse (also known as the "generalized inverse"), as the name suggests, represents a "generalized matrix inverse." It effectively extends the concept of the inverse matrix to non-invertible matrices.

Consider the matrix equation $\boldsymbol{A}\boldsymbol{B}=\boldsymbol{M}$. If $\boldsymbol{A}$ is a square matrix and invertible, we directly obtain $\boldsymbol{B}=\boldsymbol{A}^{-1}\boldsymbol{M}$. But what if $\boldsymbol{A}$ is non-invertible or not even square? In such cases, we may be unable to find $\boldsymbol{B}$ satisfying $\boldsymbol{A}\boldsymbol{B}=\boldsymbol{M}$. If we wish to proceed, we typically reformulate the problem as an optimization:

(1) \[ \mathop{\text{argmin}}_\boldsymbol{B} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2 \]

where $\boldsymbol{A}\in\mathbb{R}^{n\times r}, \boldsymbol{B}\in\mathbb{R}^{r\times m}, \boldsymbol{M}\in\mathbb{R}^{n\times m}$. Note that the field is $\mathbb{R}$, indicating that this series focuses on real matrices. Here, $\Vert\cdot\Vert_F$ denotes the Frobenius norm of a matrix, measuring the distance between $\boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}$ and the zero matrix, defined as:

(2) \[ \Vert \boldsymbol{M}\Vert_F = \sqrt{\sum_{i=1}^n\sum_{j=1}^m M_{i,j}^2} \]

Low-Rank Approximation Context

In essence, we transition from seeking an exact inverse to minimizing the squared error between $\boldsymbol{A}\boldsymbol{B}$ and $\boldsymbol{M}$. The theme of this series is low-rank approximation, so we assume $r \ll \min(n,m)$. In machine learning terms, this corresponds to reconstructing the complete $\boldsymbol{M}$ matrix through low-dimensional, lossy input matrix $\boldsymbol{A}$ and linear transformation $\boldsymbol{B}$.

When $m=n$ and $\boldsymbol{M}$ is taken as the identity matrix $\boldsymbol{I}_n$, we obtain a result dependent solely on $\boldsymbol{A}$, denoted as:

(3) \[ \boldsymbol{A}^{\dagger} = \mathop{\text{argmin}}_\boldsymbol{B} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 \]

This behaves similarly to an inverse of $\boldsymbol{A}$, hence it is called the "(right) pseudo-inverse" of $\boldsymbol{A}$. Similarly, if $\boldsymbol{B}$ is given, we can switch the optimization variable to $\boldsymbol{A}$, obtaining the "(left) pseudo-inverse" of $\boldsymbol{B}$:

(4) \[ \boldsymbol{B}^{\dagger} = \mathop{\text{argmin}}_\boldsymbol{A} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 \]

Before proceeding further, let us supplement some background on the Frobenius norm. Vector norms are presumably familiar to most readers, with the classic "$p$-norm": for $\boldsymbol{x}=(x_1,x_2,\cdots,x_m)$, its $p$-norm is defined as:

(5) \[ \Vert \boldsymbol{x}\Vert_p = \sqrt[\uproot{10}p]{\sum_{i=1}^m |x_i|^p} \]

Among $p$-norms, the most common is $p=2$, which is the frequently mentioned vector length, also known as the "Euclidean norm." If the norm subscript is omitted and written simply as $\Vert \boldsymbol{x}\Vert$, it almost always defaults to $p=2$.

Matrix norms are slightly more complex, with at least two different yet commonly used norms. One is the Frobenius norm already mentioned in the previous section, which is computed by flattening the matrix into a vector:

(6) \[ \Vert \boldsymbol{M}\Vert_F = \Vert \text{vec}(\boldsymbol{M})\Vert_2 = \sqrt{\sum_{i=1}^n\sum_{j=1}^m M_{i,j}^2} \]

Other matrix norms will be introduced when encountered. Due to the diversity of matrix norms, the subscript ${}_F$ in $\Vert \boldsymbol{M}\Vert_F$ generally cannot be omitted to avoid confusion. The Frobenius norm is derived by treating a matrix as a vector and applying the vector norm definition. This inspires us to try extending more vector operations to matrices, such as the inner product:

(7) \[ \langle \boldsymbol{P}, \boldsymbol{Q} \rangle_F = \langle \text{vec}(\boldsymbol{P}), \text{vec}(\boldsymbol{Q}) \rangle = \sum_{i=1}^n\sum_{j=1}^m P_{i,j} Q_{i,j} \]

This is called the Frobenius inner product of matrices $\boldsymbol{P},\boldsymbol{Q}$, where $\boldsymbol{P},\boldsymbol{Q}\in\mathbb{R}^{n\times m}$. It can be expressed using the matrix trace operation:

(8) \[ \langle \boldsymbol{P}, \boldsymbol{Q} \rangle_F = \text{Tr}(\boldsymbol{P}^{\top} \boldsymbol{Q}) \]

This can be directly proved using the definitions of matrix multiplication and trace (the reader may attempt this proof). When $\boldsymbol{P},\boldsymbol{Q}$ arise from products of multiple matrices, converting to equivalent trace operations often aids simplification. For example, using it we can prove that orthogonal transformations preserve the Frobenius norm: assuming $\boldsymbol{U}$ is an orthogonal matrix, using $\Vert \boldsymbol{M}\Vert_F^2 = \langle \boldsymbol{M}, \boldsymbol{M} \rangle_F$ and the relationship between Frobenius inner product and trace, we obtain:

(9) \[ \Vert \boldsymbol{U}\boldsymbol{M}\Vert_F^2 = \text{Tr}((\boldsymbol{U}\boldsymbol{M})^{\top} \boldsymbol{U}\boldsymbol{M})= \text{Tr}(\boldsymbol{M}^{\top} \boldsymbol{U}^{\top}\boldsymbol{U}\boldsymbol{M})=\text{Tr}(\boldsymbol{M}^{\top} \boldsymbol{M}) = \Vert \boldsymbol{M}\Vert_F^2 \]

Matrix Derivatives#

Matrix Calculus Approach

Returning to our main topic, for an optimization objective, the ideal outcome naturally is obtaining an analytical solution through differentiation—and equation (1) achieves precisely that! This conclusion can be roughly "eyeballed": $\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}$ is a linear function of $\boldsymbol{B}$, so $\Vert \boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}\Vert_F^2$ is a quadratic function of $\boldsymbol{A}$ or $\boldsymbol{B}$, and quadratic functions have analytical minima.

To find the derivative of $\mathcal{L}=\Vert \boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}\Vert_F^2$ with respect to $\boldsymbol{B}$, we first need the derivative of $\mathcal{L}$ with respect to $\boldsymbol{E}=\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}$, then the derivative of $\boldsymbol{E}$ with respect to $\boldsymbol{B}$, and finally combine them via the chain rule:

(10) \[ \frac{\partial \mathcal{L}}{\partial B_{i,j}} = \sum_{k,l}\frac{\partial \mathcal{L}}{\partial E_{k,l}}\frac{\partial E_{k,l}}{\partial B_{i,j}} \]

By definition $\mathcal{L}=\Vert \boldsymbol{E}\Vert_F^2 = \sum_{i,j} E_{i,j}^2$, clearly among the many squared terms in the summation, only when $(i,j)=(k,l)$ is the derivative with respect to $E_{k,l}$ non-zero. Therefore, $\mathcal{L}$'s derivative with respect to $E_{k,l}$ is simply the derivative of $E_{k,l}^2$ with respect to $E_{k,l}$, i.e., $\frac{\partial \mathcal{L}}{\partial E_{k,l}}=2E_{k,l}$. Next, from the definition of matrix multiplication:

(11) \[ E_{k,l} = \left(\sum_{\alpha} A_{k,\alpha}B_{\alpha,l}\right) - M_{k,l} \]

Similarly, only when $(\alpha,l)=(i,j)$ does the above derivative with respect to $B_{i,j}$ produce a non-zero result $A_{k,i}$. Thus we can write $\frac{\partial E_{k,l}}{\partial B_{i,j}}=A_{k,i}\delta_{l,j}$, where $\delta_{l,j}$ is the Kronecker delta, indicating the condition $l=j$. Combining results, we obtain:

(12) \[ \frac{\partial \mathcal{L}}{\partial B_{i,j}} = 2\sum_{k,l}E_{k,l}A_{k,i}\delta_{l,j} = 2\sum_k E_{k,j}A_{k,i} \]

If we adopt the convention that the gradient of a scalar with respect to a matrix has the same shape as the matrix itself, we can write:

(13) \[ \frac{\partial \mathcal{L}}{\partial \boldsymbol{B}} = 2\boldsymbol{A}^{\top}(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}) \]

Although the derivation process requires careful steps, the result is quite intuitive: intuitively $\frac{\partial \mathcal{L}}{\partial \boldsymbol{B}}$ should be the product of $2(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M})$ and $\boldsymbol{A}$ (analogous to scalar differentiation). Since we've established that $\frac{\partial \mathcal{L}}{\partial \boldsymbol{B}}$ has the same shape as $\boldsymbol{B}$ (i.e., $r\times m$), we need to combine $2(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M})\in\mathbb{R}^{n\times m}$ and $\boldsymbol{A}\in\mathbb{R}^{n\times r}$ to yield an $r\times m$ result—and the right-hand side of the above equation is the only way. By the same principle, we can quickly write:

(14) \[ \frac{\partial \mathcal{L}}{\partial \boldsymbol{A}} = 2(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M})\boldsymbol{B}^{\top} \]

Basic Results#

Now that we have obtained $\frac{\partial \mathcal{L}}{\partial \boldsymbol{B}}$ and $\frac{\partial \mathcal{L}}{\partial \boldsymbol{A}}$, setting them to zero yields the respective optimal solutions:

(15) \[ \begin{aligned} 2\boldsymbol{A}^{\top}(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M})&=0 \quad\Rightarrow\quad (\boldsymbol{A}^{\top} \boldsymbol{A})^{-1}\boldsymbol{A}^{\top}\boldsymbol{M} = \mathop{\text{argmin}}_\boldsymbol{B} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2 \\ 2(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M})\boldsymbol{B}^{\top}&=0 \quad\Rightarrow\quad \boldsymbol{M}\boldsymbol{B}^{\top}(\boldsymbol{B} \boldsymbol{B}^{\top})^{-1} = \mathop{\text{argmin}}_\boldsymbol{A} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2 \end{aligned} \]

Substituting $\boldsymbol{M}=\boldsymbol{I}_n$, we obtain:

(16) \[ \begin{aligned} \boldsymbol{A}^{\dagger} &= (\boldsymbol{A}^{\top} \boldsymbol{A})^{-1}\boldsymbol{A}^{\top} \\ \boldsymbol{B}^{\dagger} &= \boldsymbol{B}^{\top}(\boldsymbol{B} \boldsymbol{B}^{\top})^{-1} \end{aligned} \]

If $\boldsymbol{A}$ or $\boldsymbol{B}$ is an invertible square matrix, it is straightforward to prove that the pseudo-inverse equals the ordinary inverse, i.e., $\boldsymbol{A}^{\dagger}=\boldsymbol{A}^{-1},\boldsymbol{B}^{\dagger}=\boldsymbol{B}^{-1}$. Additionally, from the above equations we can verify:

Key Properties of Pseudo-Inverses

1. $(\boldsymbol{A}^{\dagger})^{\dagger}=\boldsymbol{A},(\boldsymbol{B}^{\dagger})^{\dagger}=\boldsymbol{B}$, meaning the pseudo-inverse of a pseudo-inverse equals the original matrix. This indicates that while serving as an approximate inverse, the pseudo-inverse preserves the original matrix's information.

2. $\boldsymbol{A}\boldsymbol{A}^{\dagger}\boldsymbol{A}=\boldsymbol{A},\boldsymbol{B}^{\dagger}\boldsymbol{B}\boldsymbol{B}^{\dagger}=\boldsymbol{B}^{\dagger}$, meaning $\boldsymbol{A}\boldsymbol{A}^{\dagger},\boldsymbol{B}^{\dagger}\boldsymbol{B}$, although not identity matrices $\boldsymbol{I}$, act as identity operators on $\boldsymbol{A},\boldsymbol{B}^{\dagger}$ respectively.

Incidentally, the concept of matrix pseudo-inverse is actually quite broad, encompassing many different forms. What we have introduced here is the most common "Moore–Penrose inverse." Besides this, there exist "Drazin inverse," "Bott–Duffin inverse," etc., but as the author is not familiar with these, we will not elaborate further. Readers may refer to the Wikipedia entry on "Generalized Inverse".

General Form#

However, the story is not yet complete. Equations (16) hold under the crucial premise that the corresponding $\boldsymbol{A}^{\top} \boldsymbol{A}$ or $\boldsymbol{B} \boldsymbol{B}^{\top}$ is invertible. What if they are non-invertible?

Let's take $\boldsymbol{A}^{\dagger}$ as an example. Suppose $\boldsymbol{A}^{\top} \boldsymbol{A}$ is non-invertible. This implies that $\boldsymbol{A}$ has rank less than $r$. We can only find $s < r$ column vectors forming a maximal linearly independent set, constituting matrix $\boldsymbol{A}_s\in\mathbb{R}^{n\times s}$. Then $\boldsymbol{A}$ can be expressed as $\boldsymbol{A}_s \boldsymbol{P}$, where $\boldsymbol{P}\in\mathbb{R}^{s\times r}$ is the matrix mapping $\boldsymbol{A}_s$ to $\boldsymbol{A}$. At this point:

(17) \[ \mathop{\text{argmin}}_\boldsymbol{B} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 = \mathop{\text{argmin}}_\boldsymbol{B} \Vert \boldsymbol{A}_s \boldsymbol{P}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 \]

If the optimal solution for $\boldsymbol{B}$ is still denoted as $\boldsymbol{A}^{\dagger}$, we can only determine:

(18) \[ \boldsymbol{P}\boldsymbol{A}^{\dagger} = \boldsymbol{A}_s^{\dagger} = (\boldsymbol{A}_s^{\top} \boldsymbol{A}_s)^{-1}\boldsymbol{A}_s^{\top} \]

Since we assumed $\boldsymbol{A}_s$ is a maximal linearly independent set, $\boldsymbol{A}_s^{\top} \boldsymbol{A}_s$ is necessarily invertible, so the above is well-defined. However, going from $\boldsymbol{A}^{\dagger}$ to $\boldsymbol{P}\boldsymbol{A}^{\dagger}$ is a dimension-reduction process, meaning multiple $\boldsymbol{A}^{\dagger}$ satisfy $\boldsymbol{P}\boldsymbol{A}^{\dagger} = \boldsymbol{A}_s^{\dagger}$. That is, the optimal solution of objective (3) is not unique when $\boldsymbol{A}^{\top} \boldsymbol{A}$ is non-invertible; we cannot determine a unique pseudo-inverse $\boldsymbol{A}^{\dagger}$ solely from objective (3).

Regularization Approach

One possible approach is to supplement conditions like $(\boldsymbol{A}^{\dagger})^{\dagger}=\boldsymbol{A}$ or $\boldsymbol{A}^{\dagger}\boldsymbol{A}\boldsymbol{A}^{\dagger}=\boldsymbol{A}^{\dagger}$, which combined with $\boldsymbol{P}\boldsymbol{A}^{\dagger} = \boldsymbol{A}_s^{\dagger}$ would uniquely determine $\boldsymbol{A}^{\dagger}$. However, this feels too much like patching. In practice, we can handle this problem more elegantly and uniformly using a clever technique. The issue arises because when $\boldsymbol{A}^{\top} \boldsymbol{A}$ is non-invertible, the objective function (3) has multiple optimal solutions. We can add a regularization term to make it unique, then take the limit as the regularization weight approaches zero:

(19) \[ \boldsymbol{A}^{\dagger} = \lim_{\epsilon\to 0}\,\mathop{\text{argmin}}_\boldsymbol{B} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 + \epsilon\Vert \boldsymbol{B}\Vert_F^2 \]

Here $\epsilon > 0$, and $\epsilon\to 0$ means approaching zero from the positive side. From this we can solve:

(20) \[ \boldsymbol{A}^{\dagger} = \lim_{\epsilon\to 0}\,(\boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r)^{-1}\boldsymbol{A}^{\top} \]

When $\epsilon > 0$, $\boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r$ is necessarily invertible (the reader may prove this). Therefore the above expression is well-defined. Since as $\epsilon\to 0$ the regularization term becomes negligible, this limit must exist. Note that we refer to the existence of the overall limit: when $\boldsymbol{A}^{\top} \boldsymbol{A}$ is non-invertible, the limit $\lim\limits_{\epsilon\to 0}\,(\boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r)^{-1}$ does not exist (the result would involve infinities). Only after multiplying by $\boldsymbol{A}^{\top}$ and then taking the overall limit do we obtain a well-behaved result.

What advantages does Equation (20) offer as a general extension of the pseudo-inverse? First, we already have expression (16) for $\boldsymbol{A}^{\dagger}$ when $\boldsymbol{A}^{\top} \boldsymbol{A}$ is invertible. Equation (20), as its generalization, possesses theoretical elegance with intuitive and consistent form. Second, formal consistency ensures that properties of $\boldsymbol{A}^{\dagger}$ when $\boldsymbol{A}^{\top} \boldsymbol{A}$ is invertible, such as $(\boldsymbol{A}^{\dagger})^{\dagger}$, are preserved, allowing us to discuss $\boldsymbol{A}^{\dagger}$ almost without considering the invertibility of $\boldsymbol{A}^{\top} \boldsymbol{A}$.

Numerical Computation#

Practical Computation Methods

Of course, Equation (20) is currently merely a formal definition. If we directly use it for numerical computation, we would need to choose a sufficiently small $\epsilon$ and compute $(\boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r)^{-1}$, inevitably facing severe numerical instability. To obtain a stable computation method, we utilize the fact that real symmetric matrices can always be orthogonally diagonalized (the Spectral Theorem), decomposing $\boldsymbol{A}^{\top} \boldsymbol{A}$ as:

(21) \[ \boldsymbol{A}^{\top} \boldsymbol{A} = \boldsymbol{U}\boldsymbol{\Lambda} \boldsymbol{U}^{\top} \]

where $\boldsymbol{U}$ is an orthogonal matrix, $\boldsymbol{\Lambda}=\text{diag}(\lambda_1,\lambda_2,\cdots,\lambda_r)$ is a diagonal matrix of eigenvalues. Due to the positive semidefiniteness of $\boldsymbol{A}^{\top} \boldsymbol{A}$, its eigenvalues are always non-negative. Using this decomposition, we have:

(22) \[ \begin{aligned} (\boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r)^{-1}\boldsymbol{A}^{\top} &= (\boldsymbol{U}\boldsymbol{\Lambda} \boldsymbol{U}^{\top} + \epsilon \boldsymbol{I}_r)^{-1} \boldsymbol{A}^{\top} \\ &= [\boldsymbol{U}(\boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r) \boldsymbol{U}^{\top}]^{-1}\boldsymbol{A}^{\top} \\ &= \boldsymbol{U}(\boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r)^{-1} \boldsymbol{U}^{\top}\boldsymbol{A}^{\top} \end{aligned} \]

For $(\boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r)^{-1}$ we have:

(23) \[ (\boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r)^{-1} = \text{diag}\Big((\lambda_1 + \epsilon)^{-1},(\lambda_2 + \epsilon)^{-1},\cdots,(\lambda_r + \epsilon)^{-1}\Big) \]

If $\lambda_i > 0$, then $\lim\limits_{\epsilon\to 0}\,(\lambda_i + \epsilon)^{-1}=\lambda_i^{-1}$, a finite result that doesn't hinder computation. The problem arises when $\lambda_i = 0$, where $\lim\limits_{\epsilon\to 0}\,(\lambda_i + \epsilon)^{-1}=\lim\limits_{\epsilon\to 0}\,\epsilon^{-1}\to\infty$. However, we know that as $\epsilon\to 0$ the regularization effect vanishes, so we conclude that limit (20) cannot produce infinite values. Therefore, if $\lambda_i=0$ exists, the multiplication by $\boldsymbol{U}^{\top}\boldsymbol{A}^{\top}$ on the right must somehow cancel the infinity brought by $\lim\limits_{\epsilon\to 0}\, \epsilon^{-1}$. The only way to cancel such infinity is "multiplication by 0," i.e., $\lim\limits_{\epsilon\to 0}\, \epsilon^{-1}\times 0 = 0$.

In other words, if $\lambda_i=0$, then the factor multiplied by $(\lambda_i+\epsilon)^{-1}$ from $\boldsymbol{U}^{\top}\boldsymbol{A}^{\top}$ must be $0$. Since "0 times any number yields 0," the actual value of $(\lambda_i+\epsilon)^{-1}$ when $\lambda_i=0$ becomes unimportant; we can simply set it to 0. Thus, we obtain a straightforward general method for computing $\boldsymbol{A}^{\dagger}$:

(24) \[ \boldsymbol{A}^{\dagger} = \boldsymbol{U}\boldsymbol{\Lambda}^{\dagger}\boldsymbol{U}^{\top}\boldsymbol{A}^{\top}, \quad \boldsymbol{A}^{\top} \boldsymbol{A} = \boldsymbol{U}\boldsymbol{\Lambda} \boldsymbol{U}^{\top} \]

where $\boldsymbol{\Lambda}^{\dagger}$ means: if a diagonal element is zero, leave it unchanged; if non-zero, take its reciprocal.

Some readers might question: since "0 times any number yields 0," why keep zero $\lambda_i$ unchanged? Could we arbitrarily assign another value? Actually, choosing any other value wouldn't affect the result. However, since we use the notation $\boldsymbol{\Lambda}^{\dagger}$, we should maintain consistency with Equation (20), i.e., it should match the direct computation result when substituting diagonal matrix $\boldsymbol{\Lambda}$ into Equation (20):

(25) \[ \boldsymbol{\Lambda}^{\dagger} = \lim_{\epsilon\to 0}\,(\boldsymbol{\Lambda}^{\top} \boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r)^{-1}\boldsymbol{\Lambda}^{\top} = \text{diag}(\kappa_1,\kappa_2,\cdots,\kappa_n),\quad \kappa_i = \left\{\begin{aligned}\lambda_i^{-1}, \,\,\lambda_i\neq 0 \\ 0, \,\,\lambda_i=0 \end{aligned} \right. \]

Article Summary#

In this article, we introduced the pseudo-inverse from a low-rank approximation perspective. This concept extends the notion of matrix inverses to non-square or non-invertible matrices, enabling more effective analysis and solution of general matrix equations. The optimization-based derivation provides both theoretical insight and practical computational methods, establishing important foundations for subsequent discussions on low-rank approximation techniques.

Citation Information

Original Article: Su Jianlin. The Path to Low-Rank Approximation (Part 1): Pseudo-Inverse. Scientific Spaces.

How to cite this translation:

Su, J. The Path to Low-Rank Approximation (Part 1): Pseudo-Inverse [Translated by ScalingOpt Team]. Scientific Spaces.

BibTeX:

@article{su2025lowrank_pseudoinverse, title = {The Path to Low-Rank Approximation (Part 1): Pseudo-Inverse}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/10565}, note = {Translated by ScalingOpt Team} }

Academic Translation

Optimization Perspective#

Norm-Related Concepts#

Matrix Derivatives#

Basic Results#

General Form#

Numerical Computation#

Article Summary#