Notebook

In [2]:

%matplotlib inline
from IPython.external.mathjax import install_mathjax
install_mathjax()

Downloading mathjax source from https://github.com/mathjax/MathJax/archive/2.4.0.tar.gz
Extracting to /Users/dyanos/.ipython/nbextensions/mathjax

Out[2]:

Linear Algebra¶

The Matrix Cookbook(Petersen and Pedersen, 2006) or Shilov 1977

Mathematical Object of Linear Algebra¶

Scalar
- A single number
- lower-case variable names with Italics : $x$
- $s \in \mathbb{R}$
- $n \in \mathbb{N}$
Vector
- An array of numbers
- lower-case names written in bold typeface, for example : $\pmb{x}$
- The first element of $\pmb{x}$ is $x_{1}$ , The second element is $x_{2}$ , and so on.
- If the length of vector is n and each element is in $\mathbb{R}$ , then the vector lies in the set formed by taking the Cartesian product of $\mathbb{R}$ $n$ times, denoted as $\mathbb{R}^n$ .
- $\pmb{x} = \begin{bmatrix} x_{1} \\ x_{2} \\ \cdots \\ x_{n}\\ \end{bmatrix}$
- We can think of vectors as identifying points in space, which each element giving the coordinate along a different axis.
- If we need to access $x_{1}$ , $x_{3}$ , $x_{6}$ , we define the set $S=\{1,3,5\}$ and write $\pmb{x}_{S}$ .
- $x_{-1}$ is the vector containing all elements of $\pmb{x}$ except for $x_{1}$ .
- $\pmb{x}_{S}$ is the vector containing all of the elements of $\pmb{x}$ except for $x_{1}$ , $x_{3}$ , $x_{6}$ .
Matrices
- a 2-D array of numbers
- upper-case variable names with bold typeface : $\pmb{A}$
- $\pmb{A}^{m \times n}$ has a height of $m$ and a width of $n$ .
- $A_{1,1}$
- $A_{i,:}$ : the i-th row of $\pmb{A}$
- $\begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \end{bmatrix}$
- $\pmb{A} = \begin{bmatrix} a_{1,1} & a_{1,2} \\ a_{2,1} & a_{2,2} \\ a_{3,1} & a_{3,2} \end{bmatrix} \Rightarrow \pmb{A}^{T} = \begin{bmatrix} a_{1,1} & a_{2,1} & a_{3,1} \\ a_{1,2} & a_{2,2} & a_{3,2} \end{bmatrix}$
Tensors
- more than two axes.
- $A_{i,j,k}$ : the element of $\mathbf{A}$

Operations¶

Adding
- $\pmb{C}=\pmb{A}+\pmb{B} \Rightarrow C_{i,j} = A_{i,j} + B_{i,j}$
- $\pmb{D}=a \cdot \pmb{B} + c \Rightarrow D_{i,j}=a \cdot B_{i,j} + c$
Multiplying Matrices and Vectors
- If $\pmb{A}$ is of shape $m \times n$ and $\pmb{B}$ is of shape $n \times p$ , then $\pmb{C}$ is of shape $m \times p$ .
- $\pmb{C}=\pmb{AB} \Rightarrow c_{i,j}=\Sigma_{k} a_{i,k} b_{k,j}$
- dot product : $\pmb{x} \cdot \pmb{y} = \pmb{x}^{T} \pmb{y}$
- $\pmb{A}(\pmb{B} + \pmb{C})=\pmb{AB} + \pmb{AC}$
- $\pmb{A}(\pmb{BC})=(\pmb{AB})\pmb{C}$
- $\pmb{AB} \neq \pmb{BA}$
- $(\pmb{AB})^{T}=\pmb{B}^{T} \pmb{A}^{T}$
- $s \pmb{A}=\begin{bmatrix} sa_{1,1} & \dots & sa_{1,n} \\ \vdots & \ddots & \vdots \\ sa_{m,1} & \dots & sa_{m,n} \end{bmatrix}$
- $\pmb{Ax}=\pmb{b}$
AA1,:xx=b1AA2,:xx=b2…AAm,:xx=bm
more explicitly:
a1,1x1+a1,2x2+⋯+a1,nxn=b1a2,1x1+a2,2x2+⋯+a2,nxn=b2…am,1x1+am,2x2+⋯+am,nxn=bm

Identity and Inverse Matrices¶

Identify Matrix : $\forall \pmb{x} \in \mathbb{R}^n$ , $\pmb{I}_{n} \pmb{x} = \pmb{x}$
Inverse Matrix : $\pmb{A}^{-1} \pmb{A} = \pmb{I}_{n}$
$\begin{align} \pmb{A} \pmb{x} &= \pmb{b} \\ \pmb{A}^{-1} \pmb{Ax} &= \pmb{A}^{-1} \pmb{b} \\ \pmb{I}_{n} \pmb{x} &= \pmb{A}^{-1} \pmb{b} \\ \pmb{x} &= \pmb{A}^{-1} \pmb{b} \end{align}$

Linear Dependence, Span, and Rank¶

$\pmb{Ax} = \sum_{i} x_{i} \pmb{A}_{:,i}$

In general, this kind of operation is called a linear combination.

$\sum_{i} c_{i} \pmb{v}^{(i)}$
The span of a set of vectors is the set of all points obtainable by linear combination of the original vectors.
Determining whether $\pmb{Ax}=\pmb{b}$ has a solution thus amounts to testing whether $\pmb{b}$ is in the span of the columns of $\pmb{A}$ . This particular span is known as the column space or the range of $\pmb{A}$ .
In order for the system $\pmb{Ax}=\pmb{b}$ to have a solution for all values of $\pmb{b} \in \mathbb{R}^{m}$ , we therefore require that the column space of $\pmb{A}$ be all of $\mathbb{R}^{m}$ .
Having $n \le m$ is only a necessary condition for every point to have a solution. However, It's not a sufficient condition, because it's possible for some of the columns to be redundant, A $2 \times 2$ matrix where both of the columns are equal to each other.
Formally, this kind of redundancy is known as linear dependence. A set of vectors is linearly independent if no vector in the set is a linear combination of the other vectors.

Norms¶

$L^{p}$ norm

$||\pmb{x}||_{p} = \biggl( \sum_{i} |x_i|^p \biggr)^{\frac{1}{p}}$

for $p \in \mathbb{R}$ , $p \ge 1$

Norms, including the $L^{p}$ norm, are functions mapping vectors to non-negative values, satisfying these properties that make them behave like distances between points:

$\begin{align} & f(\pmb{x})=0 \Rightarrow \pmb{x} = \pmb{0} \\ & f(\pmb{x}+\pmb{y}) \le f(\pmb{x}) + f(\pmb{y}), (\text{the triangle inequality}) \\ & \forall \alpha \in \mathbb{R}, f(\alpha \pmb{x}) = |\alpha|f(\pmb{x}) \end{align}$

The $L^{2}$ norm, with $p = 2$ , is known as the Euclidean norm. It is simply the Euclidean distance from the origin to the point identified by $\pmb{x}$ .
Also, the $L^{1}$ norm is commonly used in machine learning when the difference between zero and nonzero elements is very important.
The $l_{\infty}$ norm, also knwon as the max norm,

$||x||_{\infty}=\max_{i}|x_{i}|$

Sometimes we may also wisth to measure the size of a matrix. In the context of deep learning, the most common way to do this is with the otherwise obscure Frobenius norm

$||A||_{F}=\sqrt{\sum_{i,j} a_{i,j}^2}$

$\pmb{x}^T \pmb{y} = ||\pmb{x}||_{2} ||\pmb{y}||_{2} \cos{\theta}$

Special Kinds of Matrices and Vectors¶

Diagonal matrix $\pmb{D}$ is when $d_{i,j}=0$ for all $i \ne j$ . For example, the identity matrix.
$diag(\pmb{v})$ : to denote a square diagonal matrix whose diagonal entries are given by the entries of the vector $\pmb{v}$ .
A symmetric matrix is any matrix that is equal to its own transpose:

$\pmb{A}=\pmb{A}^T$

A unit vector is a vector with unit norm:

$||\pmb{x}||_{2}=1$

A vector $\pmb{x}$ and a vector $\pmb{y}$ are orthogonal to each other if $\pmb{x}_{T} \pmb{y}=0$ .

If the vectors are not only orthogonal but also have unit norm, we call them orthonormal.
An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal:

$\pmb{A}^T\pmb{A} = \pmb{A}\pmb{A}^T = \pmb{I}$

This implies that

$\pmb{A}^{-1}=\pmb{A}^{T}$

so orthogonal matrices are of interest because their inverse is very cheap to compute.

Eigendecomposition¶

Eigendecomposition : a kinds of matrix decomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.
Eigenvector : An eigenvector of a square matrix $\pmb{A}$ is a non-zero vector $v$ such that multiplication by $\pmb{A}$ alters only the scale of $\pmb{v}$ :

$\pmb{A} \pmb{v} = \pmb{\lambda} \pmb{v}$

The scalar $\lambda$ is known as the *eigenvalue corresponding to this eigenvector.
If eigenvectors $\{\pmb{v}^{(1)},\dots,\pmb{v}^{(n)}\}$ and corresponding eigenvalues $\{\lambda_1,\dots,\lambda_{n}\}$ by concatenating the eigenvectors into a matrix $\pmb{V}=[\pmb{v}^{(1)},\dots,\pmb{v}^{(n)}]$ (i.e. one column per eigenvector), and concatenating the eigenvalues into a vector $\lambda$ , then the matrix

$\pmb{A}=\pmb{V} diag(\lambda) \pmb{V}^{-1}$

has the desired eigenvalues and eigenvectors.

Every real symmetric matrix can be decomposed into an expression using only real-valued eigenvectors and eigenvalues:

$\pmb{A}=\pmb{Q}\pmb{\Lambda}\pmb{Q}^{T}$

where $\pmb{Q}$ is an orthogonal matrix composed of eigenvectors of $\pmb{A}$ , and $\pmb{A}$ is a diagonal matrix, with $\lambda_{i,i}$ being the eigenvalue corresponding to $\pmb{Q}_{:,i}$ .

Singular Value Decomposition¶

The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values.
Every real matrix has a singular value decomposition, but the same is not true of the eigenvalue decomposition.
If a matrix is not square, the eigendecomposition is not defined, and we must use a singular value decomposition instead.

$\pmb{A}=\pmb{U} \pmb{D} \pmb{V}^{T}$

Suppose that $\pmb{A}$ is an $m \times n$ matrix. Then $\pmb{U}$ is defined to be an $m\times m$ matrix, $\pmb{D}$ to be an $m \times n$ matrix, and $\pmb{W}$ to be an $n \times n$ matrix.

The matrices $\pmb{U}$ and $\pmb{V}$ are both defined to be orthogonal matrices. The matrix $\pmb{D}$ is defined to be a diagonal matrix. Note that $\pmb{D}$ is not necessarily square.
The elements along the diagonal of $\pmb{D}$ are known as the singular values of the matrix $\pmb{A}$ . The columns of $\pmb{U}$ are known as the left-singular vectors. The columns of $\pmb{V}$ are known as the right-singular vectors.
We can actually interpret the singular value decomposition of $\pmb{A}$ in terms of the eigendecomposition of functions of $\pmb{A}$ .
The left-singular vectors of $\pmb{A}$ are the eigenvectors of $\pmb{A}\pmb{A}^{T}$ .
The right-singular vectors of $\pmb{A}$ are the eigenvectors of $\pmb{A}^{T}\pmb{A}$ .
The non-zero singular values of $\pmb{A}$ are the square roots of the eigenvalues of $\pmb{A}^{T}\pmb{A}$ .

The Moore-Penrose Pseudoinverse¶

Matrix inversion is not defined for matrices that are not square.
Suppose we want to make a left-inverse $\pmb{B}$ of a matrix $\pmb{A}$ , so that we can solve a linear equation

$\pmb{Ax}=\pmb{y}$

by left-multiplying each side to obtain

$\pmb{x}=\pmb{By}$

The Moore-Penrose Pseudoinverse allows us to make some headway in these cases. The pseudoinverse of $\pmb{A}$ is defined as a matrix

$\pmb{x}^{+}=\lim_{\alpha \searrow 0}(\pmb{A}^{T} \pmb{A}^{T} + \alpha \pmb{I})^{-1} \pmb{A}^{T}$

Practical algorithm for computing the pseudoinverse are not based on this definition, but rather the formla

$\pmb{A}^{+}=\pmb{V} \pmb{D}^{+} \pmb{U}^{T}$

where $\pmb{U}$ , $\pmb{D}$ , and $\pmb{V}$ are the singular value decomposition of $\pmb{A}$ , and the pseudoinverse $\pmb{D}^{+}$ of a diagonal matrix $\pmb{D}$ is obtained by taking the reciprocal of all of its non-zero elements.

When $\pmb{A}$ has more rows than columns, then solving a linear equation using pseudoinverse provides one of the many possible solutions.
Specifically, it provides the solution $\pmb{x}=\pmb{A}^{+}\pmb{y}$ with minimal Euclidean norm $||\pmb{x}||_{2}$ among all possible solutions.
When $\pmb{A}$ has more columns than rows, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\pmb{x}$ for which $\pmb{Ax}$ is as close as possible to $\pmb{y}$ in terms of Euclidean norm $||\pmb{Ax}-\pmb{y}||_{2}$ .

The Trace Operator¶

$Tr(\pmb{A})=\sum_{i}a_{i,i}$
Frobenius norm of a matrix:

$||\pmb{A}||_{F} = \sqrt{\pmb{A}^T \pmb{A}}$

$Tr(\pmb{A})=Tr(\pmb{A}^T)$
The trace of a square matrix composed of many factors is also invariant to moving the last factor into the first position:

$Tr(\pmb{ABC})=Tr(\pmb{BCA})=Tr(\pmb{CAB})$

or more generally,

$Tr(\prod^{n}_{i=1}\pmb{F}^{(i)})=Tr(\pmb{F}^{(n)}\prod^{n-1}_{i=1}\pmb{F}^{(i)})$

Determinant¶

The determinant of a square matrix, denoted $det(\pmb{A})$ is a function mapping matrices to real scalars.
The determinant is equal to the product of all the matrix's eigenvalues.
The absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space.
If the determinant is 0, then space is contracted completely along at least on dimension, causing it to lose all of its volume.
If the determinant is 1, then the transformation is volume-preserving.

In [ ]: