Variation inference & Variational Autoencoders
$Z \sim P_\theta(Z)$ [Latent]
$X|Z \sim P_\theta(X|Z)$ [OBS model]
Given $x_1, x_2,…,x_n$ and want to learn $\theta$.
In general: $P(Z|X) = \frac{P(Z,X)}{\int P(Z’,X)dZ’}$
If $Z$ is finite
$\int P(Z’,X)dZ’ = \sum_{Z’ \in [m]} P(Z’,X)$If $Z$ is continous
This is a computational challenging!!!
Ideal is direcly parameterize a posterior $q(Z|X)$ is place of the true posterior P(Z|X).!!
Evidence Lower Bound(ELBO)
$$
\begin{aligned}
logP_\theta(x) &= log \int P_\theta(z,x) dz \\
&= log \int P_\theta(x|z) P_\theta(z) dz \\
&= log \int \frac{P_\theta(x|z) P_\theta(z)}{q(z|x)} \cdot q(z|x) dz \\
&= log \underset{q(z|x)}{\mathbb{E}} [\frac{P_\theta(x|z) P_\theta(z)}{q(z|x)}] dz \\
\end{aligned}
$$
due to Jensons inequlity
$$
\begin{aligned}
From \quad above & \geq \underset{q(z|x)}{\mathbb{E}} log[\frac{P_\theta(x|z) P_\theta(z)}{q(z|x)}]\\
& = \underset{q(z|x)}{\mathbb{E}} log P_\theta(x|z) - \underset{q(z|x)}{\mathbb{E}}\left[log \frac{q(z|x)}{P_\theta(z)}\right]\\
& = \underset{q(z|x)}{\mathbb{E}} log P_\theta(x|z) - KL(q(z|x)||P_\theta(z)) \qquad \text{ELBO}\\
\end{aligned}
$$
Variational Inference(VI) population objective:
$$
\begin{aligned}
\underset{\theta}{max} \underset{P(X)}{\mathbb{E}} log P_\theta(X) &\geq \underset{\theta , q(z|x)}{max} \underset{P(X)}{\mathbb{E}} \left[ \underset{q(z|x)}{\mathbb{E}} log P_\theta(x|z) - KL(q(z|x)||P_\theta(z)) \right] \\
& \geq \underset{\theta , \phi}{max} \underset{P(X)}{\mathbb{E}} \left[ \underset{q_\phi(z|x)}{\mathbb{E}} log P_\theta(x|z) - KL(q_\phi(z|x)||P_\theta(z)) \right]
\end{aligned}
$$
It’s hard to calculate $KL(q_\phi(z|x)||P_\theta(z))$
Let’s consider an Gaussion posteriors priors:
$$
q_\phi(z|x) = N(e_{\mu,\phi}(x), e_{\Sigma,\phi}(x))
$$
$$
P_\theta(z) = N(\mu_p, \Sigma_p)
$$
Given $q_\phi(z|x)$ & $P_\theta(z)$ as abive.
The $KL(q_\phi(z|x)||P_\theta(z))$ has a closed form!!
$$
KL(q_\phi(z|x)||P_\theta(z)) = \frac{1}{2} \left[ tr(\Sigma_p^{-1} \cdot e_{\Sigma, \phi}(x)) + |\mu_p - e_{\mu,\phi}(x)|_{\Sigma_p^-1}^2 -l + log [\frac{det \Sigma p}{det e_{\Sigma,\phi}(x)}] \right]
$$
- $l$ is dimension of the latent $z$
Computer the gradient with respect of $\phi$
first simply function:
$$
F(\phi) = \underset{q_\phi(z)}{\mathbb{E}} [f(z)]
$$
$$
\hat{F_n}(\phi) = \frac{1}{n} \sum_{i=1}^n [f(z_i)] \quad, \quad z_i \sim q_\phi(z)
$$
Clearly, the expection of $\mathbb{E}[\hat{F}_n(\phi)] = F(\phi)$
But, the gradient:
$\nabla_\phi \hat{F}_n(\phi) = 0$
due to in any auto-diff package???
The gradient is 0 is not learnable, thus, we need to use Reparameterization Trick!!
Let’s assume that $q_\phi(z) = N(\mu_\phi, \Sigma_\phi)$
$$
Z =;(equal , dist); \left[ \mu_\phi+ \Sigma_\phi^{1/2} \cdot g \right] \quad Z \sim q_\phi(z), , g \sim N(0,I)
$$
Then the equation change!!
$$
F(\phi) = \underset{q_\phi(z)}{\mathbb{E}} [f(z)]=\underset{g \sim N(0,I)}{\mathbb{E}} \left[ f(\mu_\phi+ \Sigma_\phi^{1/2} \cdot g) \right]
$$
$$
\hat{F_n}(\phi) = \frac{1}{n} \sum_{i=1}^n [f(\mu_\phi+ \Sigma_\phi^{1/2} \cdot g)] \quad, \quad g_i \sim N(0,I)
$$
Then:
$$\mathbb{E}[\hat{F}_n(\phi)] = F(\phi)$$
$$\mathbb{E}[\nabla_\phi \hat{F}_n(\phi)] = \nabla_\phi F(\phi)$$
Finite Sample VI objective
Given $x_1, x_2,…,x_n$ and want to learn $\theta$.
Draw $g_1,…,g_n \sim N(0,I)$
$$
\underset{\theta , \phi}{max} \hat{L}n(\theta, \phi) = \frac{1}{n} \sum_{i=1}^n \left[ logP_\theta(x_i|z=e_{\mu,\phi}(x_i)+e_{\Sigma,\phi}^{1/2}(x_i) \cdot g_i) -KL(g_\phi(z|x_i) || P_\theta(z))\right]
$$
For avove equation, we can run gradient ascent method on $\hat{L}n(\theta, \phi)$
Go back to see the whole process:
$$
data \underset{encode}{\to} Latent , space \underset{decode}{\to} new , data
$$
- decode use $q_\phi(z|x)$
- encode use $p_\theta(x|z)$
Reconstruction loss
Suppose $p_\theta(x|z) = N(d_{\mu, \theta}(z), d_{\Sigma, \theta}(z))$
$$
log p_\theta(x|z) = -\frac{1}{2}|x-d_{\mu, \theta}(z)|^2-\frac{1}{2}log((2\pi)^d)-\frac{1}{2}logdetd_{\Sigma, \theta}(z)
$$
$z = e_{\mu,\phi}(x)+e_{\Sigma,\phi}^{1/2}(x) \cdot g$
$z$ is new data
Thus,
$$
-\frac{1}{2}|x-d_{\mu, \theta}(z)|^2 = -\frac{1}{2}|x-d_{\mu, \theta}(e_{\mu,\phi}(x)+e_{\Sigma,\phi}^{1/2}(x) \cdot g)|^2
$$
- $x-d_{\mu, \theta}(z)$ is decoder error
- $e_{\mu,\phi}(x)+e_{\Sigma,\phi}^{1/2}(x) \cdot g)$ is the result of encoder
- translation is decoder error based on encoder
- The total equation is reconstruction error!!