Linear Algebra - Fight4354

The learning process over the past couple of days has gradually clarified my understanding of linear algebra. With this post, I summarize some key new content.

When facing matrices and singular values, one should establish the following understanding:
✅ A matrix is a spatial operator.
✅ Singular value decomposition helps you break down the essence of a matrix: rotation → stretching → rotation.
✅ The size order of singular values tells you: in which directions the matrix truly has strength and which directions are ineffective.

1. Orthogonal Matrix#

The core definition of an Orthogonal Matrix

An $n \times n$ real matrix $Q$ is called an orthogonal matrix if it satisfies

$Q^{\mathsf T}\,Q \;=\; Q\,Q^{\mathsf T}\;=\;I_n$

Here, $Q^{\mathsf T}$ is the transpose, and $I_n$ is the $n$ -dimensional identity matrix.

The inverse is the transpose

Q^{-1}=Q^{\mathsf T}

This simplifies calculations and ensures numerical stability.
An orthogonal matrix is a real matrix that "preserves the inner product"—it rotates or flips the coordinate system but never stretches or distorts it.

2. Angle Brackets#

Here

\langle q_i,\;q_j\rangle

is the symbol for the "inner product." In the most common case—real vector space $\mathbb R^n$ —it is equivalent to the dot product we are familiar with:

\langle q_i,\;q_j\rangle \;=\; q_i^{\mathsf T}\,q_j \;=\; \sum_{k=1}^n (q_i)_k\,(q_j)_k.

3. Matrix Position Exchange#

Eliminate the left $Q$

The $Q$ attached to the left of $S$
Multiply both sides by its inverse $Q^{-1}$ from the left:

Q^{-1}\,A \;=\; \cancel{Q^{-1}Q}\,S\,Q^{\mathsf T} \quad\Longrightarrow\quad Q^{-1}A \;=\; S\,Q^{\mathsf T}

Note:

Must multiply from the left consistently;

Do not attempt to multiply $Q^{-1}$ to the right (that would disrupt the order of multiplication).

Eliminate the right $Q^{\mathsf T}$

The $Q^{\mathsf T}$ attached to the right of $S$
Multiply both sides by its inverse $(Q^{\mathsf T})^{-1}$ from the right:

Q^{-1}A(Q^{\mathsf T})^{-1} \;=\; S\cancel{Q^{\mathsf T}(Q^{\mathsf T})^{-1}} \quad\Longrightarrow\quad S = Q^{-1}A(Q^{\mathsf T})^{-1}

If $Q$ is an orthogonal matrix, $Q^{-1}=Q^{\mathsf T}$ , thus

S \;=\; Q^{\mathsf T}\,A\,Q.

Why can't the order be reversed?

Once you multiply on the wrong side, the symbols will "insert" into different positions:
$AQ^{-1} \neq Q^{-1}A$ .
The same operation must be performed symmetrically on both sides of the equation for it to hold.
This is essentially the same as the order of function composition or coordinate transformation: which transformation to perform first and which second must be written in the corresponding positions of the product, and cannot be arbitrarily exchanged.

4. Similar Diagonalizable Matrix#

A similar diagonalizable matrix (commonly referred to as a "diagonalizable matrix") means:

There exists an invertible matrix $P$ such that

$P^{-1}AP = D,$

where $D$ is a diagonal matrix.
At this point, we say $A$ can be diagonalized through a similarity transformation, or simply $A$ is diagonalizable.

The "mechanical process" of diagonalization

Find the eigenvalues: Solve $\det(A-\lambda I)=0$ .
Find the eigenvectors: For each $\lambda$ , solve $(A-\lambda I)x=0$ .
Assemble $P$ : Form the matrix $P$ by stacking the $n$ linearly independent eigenvectors as columns.
Obtain $D$ : Fill the corresponding eigenvalues into the diagonal: $D=\operatorname{diag}(\lambda_1,\dots,\lambda_n)$ .
Then we have $A = P D P^{-1}$ .

5. Singular Value Decomposition#

A \;=\; Q\,S\,Q^{\mathsf T} \quad\Longleftrightarrow\quad S \;=\; Q^{\mathsf T} A Q

Symbol	Meaning
$A$	Given real symmetric matrix ( $A^{\mathsf T}=A$ )
$Q$	Orthogonal matrix: $Q^{\mathsf T}Q=I$ , with orthogonal unit length column vectors
$S$	Diagonal matrix: $S=\operatorname{diag}(\lambda_1,\dots,\lambda_n)$

The expression $A=Q S Q^{\mathsf T}$ is called orthogonal similarity diagonalization; geometrically, it means "rotate (or mirror) the coordinate system → A only retains independent stretching."

Why can "real symmetric matrices always be orthogonally diagonalized"?

Spectral Theorem:

For any real symmetric matrix $A$ , there exists an orthogonal matrix $Q$ such that $Q^{\mathsf T} A Q$ is diagonal, and the diagonal elements are the eigenvalues of $A$ .

Real eigenvalues: Symmetry guarantees all eigenvalues are real.
Orthogonal eigenvectors: If $\lambda_i \ne \lambda_j$ , the corresponding eigenvectors must be orthogonal.
Repeated roots can also take orthogonal bases: The same eigenvalue may correspond to multiple vectors; in this case, perform Gram–Schmidt in the subspace they span.

Step-by-step textual analysis

Step	Explanation
1. Find all eigenvalues and eigenvectors of $A$	Calculate $\det(A-\lambda I)=0$ to obtain all $\lambda_i$ ; for each $\lambda_i$ solve $(A-\lambda_i I)v=0$ to find eigenvectors.
2. Arrange the eigenvalues in a certain order on the diagonal to obtain the diagonal matrix $S$	For example, arrange them in ascending order as $S=\operatorname{diag}(\lambda_1,\dots,\lambda_n)$ . The order is not important as long as it is consistent with the order of the column vectors later.
3. Eigenvectors corresponding to different eigenvalues are orthogonal; for repeated eigenvalues, use Gram-Schmidt to orthogonalize and normalize	- If $\lambda_i\neq\lambda_j$ , the corresponding vectors are naturally orthogonal, no action needed.

If $\lambda$ has duplicates (geometric multiplicity >1), first take a set of linearly independent vectors, then perform Gram-Schmidt in that subspace to make them pairwise orthogonal and normalize (adjust length to 1). |
| 4. Arrange the modified eigenvectors horizontally according to the order of eigenvalues on the diagonal to obtain the orthogonal matrix $Q$ | Arrange the modified eigenvectors as columns in $Q=[q_1\ \cdots\ q_n]$ according to the order of the diagonal eigenvalues. At this point, $Q^{\mathsf T}Q=I$ , and we have $A = Q S Q^{\mathsf T}$ . |

A specific small example of $2 \times 2$

Let

A=\begin{bmatrix} 4 & 1\\ 1 & 4 \end{bmatrix}

① Find the eigenvalues

\det(A-\lambda I)=(4-\lambda)^2-1=0 \Longrightarrow \lambda_1=5,\; \lambda_2=3

② Find the eigenvectors

$\lambda_1=5$ : $(A-5I)v=0 \ \Rightarrow\ v_1=\begin{bmatrix}1\\1\end{bmatrix}$
$\lambda_2=3$ : $(A-3I)v=0 \ \Rightarrow\ v_2=\begin{bmatrix}1\\-1\end{bmatrix}$

③ Normalize

q_1=\frac{1}{\sqrt2}\begin{bmatrix}1\\1\end{bmatrix},\quad q_2=\frac{1}{\sqrt2}\begin{bmatrix}1\\-1\end{bmatrix}

④ Assemble and verify

Q=\frac1{\sqrt2}\begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix},\quad S=\begin{bmatrix}5&0\\0&3\end{bmatrix},\quad Q^{\mathsf T} A Q = S.

6. Determinant (det · )#

The determinant (determinant) is an operation that maps a square matrix $A$ to a scalar $\det A$ .
This scalar encapsulates the most core geometric and algebraic information of the matrix: volume scaling factor, invertibility, product of eigenvalues, etc.

Formula

Order	Formula
$1\times1$	$\det[a]=a$
$2\times2$	$\displaystyle\det\!\begin{bmatrix}a&b\\c&d\end{bmatrix}=ad-bc$
$3\times3$	"Sarrus' rule" or expand along the first row:
$\displaystyle \det\!\begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}=a(ei-fh)-b(di-fg)+c(dh-eg)$
Core properties (any definition must satisfy)

Property	Explanation
Multiplicativity	$\det(AB)=\det A\cdot\det B$
Invertibility Criterion	$\det A\neq0 \iff A$ is invertible
Linearity in Rows/Columns	Each row (or column) is linear in its elements
Alternating	Swapping two rows (or columns) ⇒ determinant changes sign
Diagonal Product	Upper/lower triangular matrix: $\det A=\prod_{i}a_{ii}$
Eigenvalue Product	$\det A=\lambda_1\lambda_2\cdots\lambda_n$ (including multiplicities)

3×3 Manual Calculation Example

Let

A=\begin{bmatrix} 2 & 1 & 3\\ 0 & 4 & -1\\ 5 & 2 & 0 \end{bmatrix}

Expand along the first row:

\begin{aligned} \det A &= 2\;\det\!\begin{bmatrix}4&-1\\2&0\end{bmatrix} \;-\;1\;\det\!\begin{bmatrix}0&-1\\5&0\end{bmatrix} \;+\;3\;\det\!\begin{bmatrix}0&4\\5&2\end{bmatrix} \\[4pt] &= 2\,(4\cdot0-(-1)\cdot2) -1\,(0\cdot0-(-1)\cdot5) +3\,(0\cdot2-4\cdot5) \\[4pt] &= 2\,(2) -1\,(5) +3\,(-20) \\[4pt] &= 4 - 5 - 60 = -61. \end{aligned}

In summary

"Taking the determinant" means: collapsing an $n\times n$ square matrix into a single number through a set of alternating, linear rules, and this number encodes key information such as the matrix's volume scaling, direction, invertibility, and product of eigenvalues.

7. Rank of a Matrix#

What exactly is the "rank" of a matrix?

Equivalent Perspective	Intuitive Explanation
Linear Independence	The number of linearly independent vectors that can be selected from the rows (or columns) is the rank.
Dimensionality of Space	The dimension of the subspace spanned by the column vectors (column space) = the dimension of the subspace spanned by the row vectors (row space) = the rank.
Full Rank Minor	The order of the largest non-zero determinant in the matrix = the rank.
Singular Values	In SVD $A=U\Sigma V^{\!}$ , the number of non-zero singular values* = the rank.

Linear Independence
Below are three comparative cases using a small $3 \times 3$ matrix to make the statement "rank = how many linearly independent column (or row) vectors can be selected" clear.

| Matrix $A$ | Column Vectors Written as $\bigl(v_1\,|\,v_2\,|\,v_3\bigr)$ | Linear Relationship | Rank |
|------------|-----------------------------------------------|-----------|--------|
| $\displaystyle\begin{bmatrix}1&2&3\\2&4&6\\3&6&9\end{bmatrix}$ | $v_1=\begin{bmatrix}1\\2\\3\end{bmatrix}$
$v_2=\begin{bmatrix}2\\4\\6\end{bmatrix}=2v_1$
$v_3=\begin{bmatrix}3\\6\\9\end{bmatrix}=3v_1$ | All three columns lie on the same line—only 1 independent vector | 1 |
| $\displaystyle\begin{bmatrix}1&0&1\\0&1&1\\1&1&2\end{bmatrix}$ | $v_1=\begin{bmatrix}1\\0\\1\end{bmatrix}$
$v_2=\begin{bmatrix}0\\1\\1\end{bmatrix}$
$v_3=v_1+v_2$ | $v_1, v_2$ are not collinear ⇒ 2-dimensional plane; $v_3$ lies in this plane | 2 |
| $\displaystyle\begin{bmatrix}1&0&1\\0&1&1\\1&1&0\end{bmatrix}$ | $v_1=\begin{bmatrix}1\\0\\1\end{bmatrix}$
$v_2=\begin{bmatrix}0\\1\\1\end{bmatrix}$
$v_3=\begin{bmatrix}1\\1\\0\end{bmatrix}$ | Any two columns cannot linearly express the third column ⇒ All three columns span the entire $\mathbb R^{3}$ | 3 |

How to determine "independence"?

Manual Calculation Form a matrix with the columns and perform elimination → The number of non-zero rows is the rank.
Concept If there exist constants $c_1,c_2,c_3$ such that $c_1v_1+c_2v_2+c_3v_3=0$ and not all are 0, the vectors are dependent; otherwise, they are independent.
- Case 1: $2v_1-v_2=0$ → dependent
- Case 2: Only $v_3=v_1+v_2$ is dependent, $v_1,v_2$ are independent
- Case 3: Any non-trivial combination ≠ 0 → All three vectors are independent

In summary: The rank = how much independent information (dimension) this matrix can truly "hold."

8. Low-Rank Approximation#

Why does truncating SVD (low-rank approximation) only require storing k (m+n)+k numbers?

When the original matrix

A\in\mathbb R^{m\times n}

is truncated to rank $k$ and written as

A_k \;=\; U_k \,\Sigma_k \, V_k^{\mathsf T},

Block	Shape	Number of Scalars to Save	Explanation
$U_k$	$m\times k$	$m \times k$	Left singular vectors: only take the first $k$ columns
$V_k$	$n\times k$	$n \times k$	Right singular vectors: similarly
$\Sigma_k$	$k\times k$ diagonal	$k$	Only keep the $k$ singular values on the diagonal

Adding these three parts gives

\underbrace{m k}_{U_k} \;+\; \underbrace{n k}_{V_k} \;+\; \underbrace{k}_{\Sigma_k} \;=\; k\,(m+n)+k.

$U_k$ and $V_k$ : Each has $k$ columns, and each column stores a vector of length equal to the number of rows
$\Rightarrow mk + nk$ scalars.
$\Sigma_k$ : It is a diagonal matrix, only needing those $k$ diagonal elements—not $k^2$ .

Therefore, using a rank- $k$ SVD approximation to replace the original $m\times n$ storage, the parameter count reduces from $mn$ to $k(m+n)+k$ .
If $k \ll \min(m,n)$ , the saved space is quite considerable.

Lower rank = reduced information dimension, low-rank storage = reduced parameter count/memory simultaneously

9. Norms#

The "double vertical bars" $|,\cdot,|$ in linear algebra represent norms.

For a vector $v\in\mathbb R^m$ , the most commonly used is the Euclidean norm: $\|v\|=\sqrt{v^{\mathsf T}v}=\sqrt{\sum_{i=1}^{m}v_i^{2}},\qquad \|v\|^{2}=v^{\mathsf T}v.$
In the figure, $\|Xw-y\|^{2}$ is the sum of the squares of each component of the vector $Xw-y$ .
For a matrix $A$ , if also written as $\|A\|$ , it commonly refers to the Frobenius norm: $\|A\|_F=\sqrt{\sum_{i,j}A_{ij}^{2}}$ . However, the figures here involve vectors.

In contrast, the single vertical bar $|\,\cdot\,|$ usually denotes absolute value (scalar) or determinant $|A|$ . Thus, double vertical bars signify the "length" of vectors/matrices, while single vertical bars denote scalar size or determinant—different objects and meanings.

Common Euclidean Distance for Vectors -- 2-Norm (L2 Norm)#

import torch

b = torch.tensor([3.0, 4.0])
print(b.norm())  # Outputs 5.0

.norm() is a method of PyTorch tensors (torch.Tensor).

Common Frobenius Norm for Matrices#

Matrices also have "length"—the commonly used is the Frobenius norm

Name	Notation	Formula (for $A\in\mathbb R^{m\times n}$)	Analogy with Vectors
Frobenius Norm	$\displaystyle\|A\|_F$	$\displaystyle\sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}^{2}}$	Similar to vector 2-norm $\|v\|=\sqrt{\sum v_i^2}$

1. Why can it also be written as "matrix dot product"

The commonly used inner product in matrix space is

\langle A,B\rangle := \operatorname{tr}\!\bigl(A^{\mathsf T}B\bigr),

where $\operatorname{tr}(\cdot)$ is the trace operation (sum of diagonal elements).
Taking the inner product of $A$ with itself gives

\|A\|_F^{2} \;=\;\langle A,A\rangle \;=\;\operatorname{tr}\!\bigl(A^{\mathsf T}A\bigr).

Thus:

$\boxed{\;\|A\|_F^{2}= \operatorname{tr}(A^{\mathsf T}A)\;}$

This is the matrix version of $\|v\|^{2}=v^{\mathsf T}v$ —just replacing the vector dot product with the "trace dot product."
The Frobenius norm is indeed equal to the square root of the sum of the squares of all singular values, that is:

\| A \|_F = \sqrt{\sum_i \sigma_i^2}

Here:

$| A |_F$ is the Frobenius norm of matrix $A$
$\sigma_i$ is the singular value of $A$

The Frobenius norm is indeed equal to the square root of the sum of the squares of all singular values
Expanded explanation:

The Frobenius norm is defined as:

\| A \|_F = \sqrt{ \sum_{i,j} |a_{ij}|^2 }

But singular value decomposition (SVD) tells us:

A = U \Sigma V^T

where $\Sigma$ is a diagonal matrix, with singular values $\sigma_1, \sigma_2, \dots$ on the main diagonal.

Since the Frobenius norm does not change (unit orthogonal transformations do not change the norm), we can directly compute:

\| A \|_F^2 = \sum_{i,j} |a_{ij}|^2 = \sum_i \sigma_i^2

Thus, ultimately:

\| A \|_F = \sqrt{ \sum_i \sigma_i^2 }

Beware of misconceptions

Note:
✅ Not the square root of a single singular value, nor the maximum singular value
✅ Is the sum of the squares of all singular values, then taking the square root

The spectral norm looks at "the direction that stretches the most," while the Frobenius norm accumulates all energy.

Spectral Norm of a Matrix#

✅ Definition of Spectral Norm

The spectral norm of matrix $A$ is defined as:

\| A \|_2 = \max_{\|x\|_2 = 1} \| A x \|_2

Simply put, it is the maximum value to which matrix $A$ stretches a unit vector.
Singular values inherently represent the stretching transformations of the matrix.

It equals the maximum singular value of $A$ :

\| A \|_2 = \sigma_{\max}(A)

From another perspective: the spectral norm ≈ the maximum length to which a unit vector is stretched after being input into the matrix.
✅ Its relationship with the Frobenius norm

Frobenius Norm → Looks at overall energy (sum of squares of matrix elements)
Spectral Norm → Looks at the maximum stretching amount in a single direction

In other words:

The Frobenius norm is like the total "volume" of the matrix
The spectral norm is like the "most extreme" stretching rate in a single direction

✅ Example: Why is it important?

Imagine a linear layer $W$ in a neural network:

If $\| W \|_2$ is very large, even small perturbations in the input will be amplified,
making the network prone to overfitting and sensitive to noise.
If $\| W \|_2$ is moderate, the output changes will be stable against input perturbations,
leading to better generalization.

Thus, modern methods (like spectral normalization)
will directly constrain the spectral norm of $W$ within a certain range during training.

⚠ Straightforward drawbacks

The spectral norm is powerful, but:

It only focuses on a single maximum direction, ignoring the stretching in other directions;
It is more complex to compute than the Frobenius norm (requires singular value decomposition, rather than a simple sum of squares).

Summary Comparison

	Euclidean Norm (2-norm, ‖v‖)	Frobenius Norm (‖A‖_F)
Object	Vector $v\in\mathbb R^{n}$	Matrix $A\in\mathbb R^{m\times n}$
Definition	$\displaystyle\\|v\\|=\sqrt{\sum_{i=1}^{n}v_i^{2}}$	$\displaystyle\\|A\\|_F=\sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}^{2}}$
Equivalent Expression	$\\|v\\|^{2}=v^{\mathsf T}v$	$\\|A\\|_F^{2}=\operatorname{tr}(A^{\mathsf T}A)=\sum_{k}\sigma_k^{2}$
Geometric Meaning	Length of a vector in $n$ -dimensional Euclidean space	Length when viewing matrix elements as a "long vector"
Unit/Scale	Same metric as coordinate axes	Same; does not depend on the arrangement of rows and columns for matrices
Common Uses	Error measurement, regularization $L^2$ , distance	Weight decay, matrix approximation error, kernel methods
Relationship with Spectral Norm	$\\|v\\|=\sigma_{\max }(v)$ (only one singular value)	$\\|A\\|_F\ge \\|A\\|_2=\sigma_{\max }(A)$ ; equal if rank = 1

Same idea, different dimensions

The Euclidean norm is the square root of a vector's dot product with itself.
The Frobenius norm treats all matrix elements as a long vector and does the same; written in matrix language as

$\|A\|_F=\sqrt{\operatorname{tr}(A^{\mathsf T}A)}.$

This is "transpose → multiply → take trace."

When to use which?

Scenario	Recommended Norm	Reason
Prediction error, gradient descent	Euclidean (vector residual)	Residuals are naturally column vectors
Regularization of network weights (Dense / Conv)	Frobenius	Does not care about parameter shape, only overall magnitude
Comparing matrix approximation quality (SVD, PCA)	Frobenius	Easily corresponds with the sum of squares of singular values
Stability/Lipschitz bounds	Spectral Norm ( $\\|A\\|_2$ )	Concerned with amplification rate rather than total energy

Intuitive Differences

Euclidean: Measures the length in a single direction;
Frobenius: Measures the total energy of each element, so for matrices, no particular row or column is special; all elements are treated equally.

One-sentence memory:

Euclidean Norm: The "ruler" for vectors.
Frobenius Norm: Measuring the overall size of a matrix after "flattening" it with the same ruler.

10. Transpose of Matrix Multiplication#

In matrix algebra, there is a fixed "flip order" rule for the transpose of the product of two (or more) matrices:

(AB)^{\mathsf T}=B^{\mathsf T}\,A^{\mathsf T}.

This means transpose each matrix first, then reverse the order of multiplication.
This property holds for any dimension-matching real (or complex) matrices and can be recursively extended:

(ABC)^{\mathsf T}=C^{\mathsf T}\,B^{\mathsf T}\,A^{\mathsf T},\qquad (\,A_1A_2\cdots A_k)^{\mathsf T}=A_k^{\mathsf T}\cdots A_2^{\mathsf T}A_1^{\mathsf T}.

xLog Editing Markdown Document Notes

Ensure all mathematical expressions are enclosed in $$ … $$
If there are single $n \times n$ types, change them to n × n or $$n\\times n$$

Reference video:

Click here to watch the Bilibili video