Continuous Normalizing flows

$$
\begin{aligned}
\text{simple “base” distribution(Gaussion)} &\to \text{“complex” data distribution} , p(x)\\
x &\to z
\end{aligned}
$$

When $f(x)$ is diffeomorphism, $logP(x)$ analytically.
coming up with parameterization of diffeomorphism is challenging!
one particular family comes from ODEs.
pick a vector field $v_\theta(x,t)$
$f_\theta(x) = ODEsolve(v_\theta, x, 0, T)$
$f_\theta$ is a diffeomorphism
wrote a augmented ODE to compute $logp(x)$
compute the derivitives of $logp(x)$ using the adjoint method

Neual ODE

Consider
$$
\frac{d}{dt} X(t) = v(X(t), t) \quad, X(0)=x
$$

$$
\begin{aligned}
X(t+h) &= X(t) + h\frac{d}{dt}X(t)+o(h^2) \\
&= X(t) + hv(X(t), t) + o(h^2)\\
\end{aligned}
$$

Forward Euler Discretization
$$
\begin{aligned}
& \hat{X}(k+1) = \hat{X}(k)+hv(\hat{X}(k),k \cdot h) \\
& \hat{X}(0) = x \\
& \hat{X}(k) \approx x(k \cdot h)\\
\end{aligned}
$$

Regular neural network:
One hidden layer netwrok example:
$$f(x) = w_2 \cdot \sigma(w,x) \quad[\text{from first part of the class}]$$
deep fully connected neural network:
$$
\begin{aligned}
f(x) &= w_H \cdot \sigma (w_{H-1} \cdot \sigma (w_{H-2} \cdot \sigma(… \sigma(w_1x)))) \\
&=f_H \odot f_{H-1} \odot … \odot f_1
\end{aligned}
$$

residual network:
Layer: $f_i(x) = x+\sigma(w_ix)$
network: $f(x) = f_H \odot f_{H-1} \odot … \odot f_1$

For Euler:
$$
\hat{X}(k) = \hat{X}(k-1)+hv(\hat{X}(k-1),k \cdot h)
$$
which is similar like above!!!

PCA

principal component analysis:

Input:
A data matrix $X \in R^{n \times d}$
An integer $k \in {1,…,d}$

Output:
An orthogonal basis $V \in R^{d \times k}$

Idea:
Construct a linear projection into $R^k$, which “preserves” as much as information about the data as possible!!!

center the data:
$$
\bar{X} = X - \frac{1}{n} \cdot 1_n \cdot (1_n)^T \cdot X \\
\bar{x_i} = x_i - \left( \frac{1}{n} \cdot \sum_{j=1}^n \cdot x_j \right) \\
$$

which just use $x_i$ minus the mean of all $x_i$

compute an SVD of $\bar{X}$
$$
\bar{X} = U \Sigma V^T \quad (n \geq d) \\
U \in R^{n \times d} \\
\Sigma \in R^{d \times d} \\
V \in R^{d \times d} \\
$$
Assume $\Sigma = dig(\sigma_1, …, \sigma_d)$, and $\sigma_1 \geq \sigma_2 \geq … \geq \sigma_d$
return
$$
V[:, :k] \\
V = \left[ \begin{matrix} \vdots &\vdots &\vdots \\ V_1 &\dots &V_d \\ \vdots &\vdots &\vdots \end{matrix} \right]
$$
return first k columns
$
\left[ \begin{matrix} \vdots &\vdots &\vdots \\ V_1 &\dots &V_k \\ \vdots &\vdots &\vdots \end{matrix} \right]
$

Suppose that $W$ is a subspace of $R^d$
Given $X \in R^d$
Need a point!
$$
\underset{y \in W}{argmin} |x-y|
$$
From linear algebra:
$$
X = X(\parallel W) + X(\perp W)
$$
$$
\begin{aligned}
|x-y|^2 & = |(x_{\parallel}-y)+x_{\perp}|^2 \\
&= |x_{\parallel}-y|^2 + |x_{\perp}|^2
\end{aligned}
$$

Thus,
$$
\underset{y \in W}{argmin} |x-y| = P_w x
$$

$P_w$ is orthogal projection!

This is reconstruction error:
$|X-P_wX| = |P_w^{\perp} X|$
Ideal 1:
minimize reconstruction error $|P_w^{\perp} X|$ over all k-dimensional subspaces. (for a fixed $x \in R^d$)
Let $W = span{x}$

Ideal 2:
minimize reconstruction error over a distribution
$$
\begin{aligned}
& \underset{dim(w) = k}{argmin}\underset{X \sim D}{\mathbb{E}} |P_w^{\perp} X|^2 \\
& = \underset{P \in O(d,k)}{argmin}\underset{X \sim D}{\mathbb{E}} |(I-PP^T) X|^2
\end{aligned}
$$

$$
O(d,k) = \left[ U \in R^{d \times k} | U^T U = I_k \right] \quad , (k \leq d)
$$

$$
= \underset{P \in O(d,k)}{argmin}\underset{X \sim D}{\mathbb{E}} X^T(I-PP^T)^2X
$$

$$
(I-PP^T)^2 = I-2PP^T+P(P^TP)P^T = I-PP^T
$$

$$
\begin{aligned}
& = \underset{P \in O(d,k)}{argmin}\underset{X \sim D}{\mathbb{E}} X^T(I-PP^T)X \\
& = \underset{P \in O(d,k)}{argmin}\underset{X \sim D}{\mathbb{E}} tr(X^T(I-PP^T)X) \\
& = \underset{P \in O(d,k)}{argmin}\underset{X \sim D}{\mathbb{E}} tr((I-PP^T)XX^T) \\
& = \underset{P \in O(d,k)}{argmin} \cdot tr((I-PP^T)\underset{X \sim D}{\mathbb{E}}[XX^T]) \\
\end{aligned}
$$

$\underset{X \sim D}{\mathbb{E}}[XX^T]$ as $\Sigma$

$$
\begin{aligned}
& = \underset{P \in O(d,k)}{argmin} \cdot tr((I-PP^T) \Sigma) \\
& = \underset{P \in O(d,k)}{argmin} [ tr(\Sigma) - tr(PP^T \Sigma) ]\\
& = \underset{P \in O(d,k)}{argmax} \cdot tr(PP^T \Sigma) \\
\end{aligned}
$$

$\Sigma$ is a positive semi-definite matrix

$$\Sigma = U \Lambda U^T$$
$U \in R^{d \times d}, \Lambda = diag(\lambda_1,…,\lambda_d)$
$\lambda_1 \geq … \geq \lambda_d$

Thus, for above equation, the optimal solution is to set
$$
P = U[:,:k] \quad \text{(first k col of U)}
$$

If $\bar{X} \in R^{d \times d}$
$\frac{1}{n} \bar{X}^T \bar{X} = \frac{1}{n} \sum_{i=1}^n \bar{x}_i \bar{x}_i^T$

$$
\bar{X}^T \bar{X}\\
= V \Sigma U^T \times U \Sigma V \\
= V \Sigma^2 V
$$

Hence, Setting $D = \frac{1}{n} \sum_{i=1}^n \sqrt{x_i}$
Then $\underset{P \in O(d,k)}{argmin}\underset{X \sim D}{\mathbb{E}} |(I-PP^T) X|^2$ is precisely what PCA returns!!

Varational inference/autoencoder perspection PCA

$$
(R^k) \quad Z \sim p(Z) = N(0, \beta \cdot I_k)\\
(R^d) \quad X|Z \sim p(X|Z) = N(U_z, I_d)
$$

$$
U \in O(d, k) = \left[ P \in R^{d \times k}|P^T P = I_k \right]
$$

$$
\underset{X \sim D}{\mathbb{E}} log P_u(x) \underset{(ELBO)}{ \geq } \underset{X \sim D}{\mathbb{E}} \left[ \underset{q(z|x)}{\mathbb{E}} log P_u(x|z - KL(q(z|x)||p(z)))\right]
$$

$$
q_u(z|x) = N(U^Tx, I_k)
$$
$$
L[u] = \underset{X \sim D}{\mathbb{E}}[-\frac{1}{2} |(I-uu^T)x|^2 -\frac{1}{2\beta}|u^T x|^2] + const(d,k,beta)
$$

as same as above $\Sigma = \underset{X \sim D}{\mathbb{E}}[xy^T]$
$|(I-uu^T)x|^2$ is resconstruction error
$|u^T x|^2$ is prior

$$
L[u] = -\frac{1}{2} \left[ tr((I-uu^T)\Sigma) + \frac{1}{\beta} tr(uu^T \Sigma)\right] + const(d,k,beta)
$$