Variation inference & Variational Autoencoders

$Z \sim P_\theta(Z)$ [Latent]
$X|Z \sim P_\theta(X|Z)$ [OBS model]

Given $x_1, x_2,…,x_n$ and want to learn $\theta$.

In general: $P(Z|X) = \frac{P(Z,X)}{\int P(Z’,X)dZ’}$

  • If $Z$ is finite
    $\int P(Z’,X)dZ’ = \sum_{Z’ \in [m]} P(Z’,X)$

  • If $Z$ is continous
    This is a computational challenging!!!

Ideal is direcly parameterize a posterior $q(Z|X)$ is place of the true posterior P(Z|X).!!

Evidence Lower Bound(ELBO)

$$
\begin{aligned}
logP_\theta(x) &= log \int P_\theta(z,x) dz \\
&= log \int P_\theta(x|z) P_\theta(z) dz \\
&= log \int \frac{P_\theta(x|z) P_\theta(z)}{q(z|x)} \cdot q(z|x) dz \\
&= log \underset{q(z|x)}{\mathbb{E}} [\frac{P_\theta(x|z) P_\theta(z)}{q(z|x)}] dz \\
\end{aligned}
$$

due to Jensons inequlity

$$
\begin{aligned}
From \quad above & \geq \underset{q(z|x)}{\mathbb{E}} log[\frac{P_\theta(x|z) P_\theta(z)}{q(z|x)}]\\
& = \underset{q(z|x)}{\mathbb{E}} log P_\theta(x|z) - \underset{q(z|x)}{\mathbb{E}}\left[log \frac{q(z|x)}{P_\theta(z)}\right]\\
& = \underset{q(z|x)}{\mathbb{E}} log P_\theta(x|z) - KL(q(z|x)||P_\theta(z)) \qquad \text{ELBO}\\
\end{aligned}
$$

Variational Inference(VI) population objective:

$$
\begin{aligned}
\underset{\theta}{max} \underset{P(X)}{\mathbb{E}} log P_\theta(X) &\geq \underset{\theta , q(z|x)}{max} \underset{P(X)}{\mathbb{E}} \left[ \underset{q(z|x)}{\mathbb{E}} log P_\theta(x|z) - KL(q(z|x)||P_\theta(z)) \right] \\
& \geq \underset{\theta , \phi}{max} \underset{P(X)}{\mathbb{E}} \left[ \underset{q_\phi(z|x)}{\mathbb{E}} log P_\theta(x|z) - KL(q_\phi(z|x)||P_\theta(z)) \right]
\end{aligned}
$$

It’s hard to calculate $KL(q_\phi(z|x)||P_\theta(z))$
Let’s consider an Gaussion posteriors priors:
$$
q_\phi(z|x) = N(e_{\mu,\phi}(x), e_{\Sigma,\phi}(x))
$$
$$
P_\theta(z) = N(\mu_p, \Sigma_p)
$$

Given $q_\phi(z|x)$ & $P_\theta(z)$ as abive.
The $KL(q_\phi(z|x)||P_\theta(z))$ has a closed form!!

$$
KL(q_\phi(z|x)||P_\theta(z)) = \frac{1}{2} \left[ tr(\Sigma_p^{-1} \cdot e_{\Sigma, \phi}(x)) + |\mu_p - e_{\mu,\phi}(x)|_{\Sigma_p^-1}^2 -l + log [\frac{det \Sigma p}{det e_{\Sigma,\phi}(x)}] \right]
$$

  • $l$ is dimension of the latent $z$

Computer the gradient with respect of $\phi$

first simply function:
$$
F(\phi) = \underset{q_\phi(z)}{\mathbb{E}} [f(z)]
$$

$$
\hat{F_n}(\phi) = \frac{1}{n} \sum_{i=1}^n [f(z_i)] \quad, \quad z_i \sim q_\phi(z)
$$

Clearly, the expection of $\mathbb{E}[\hat{F}_n(\phi)] = F(\phi)$

But, the gradient:
$\nabla_\phi \hat{F}_n(\phi) = 0$
due to in any auto-diff package???
The gradient is 0 is not learnable, thus, we need to use Reparameterization Trick!!

Let’s assume that $q_\phi(z) = N(\mu_\phi, \Sigma_\phi)$
$$
Z =;(equal , dist); \left[ \mu_\phi+ \Sigma_\phi^{1/2} \cdot g \right] \quad Z \sim q_\phi(z), , g \sim N(0,I)
$$
Then the equation change!!
$$
F(\phi) = \underset{q_\phi(z)}{\mathbb{E}} [f(z)]=\underset{g \sim N(0,I)}{\mathbb{E}} \left[ f(\mu_\phi+ \Sigma_\phi^{1/2} \cdot g) \right]
$$

$$
\hat{F_n}(\phi) = \frac{1}{n} \sum_{i=1}^n [f(\mu_\phi+ \Sigma_\phi^{1/2} \cdot g)] \quad, \quad g_i \sim N(0,I)
$$

Then:
$$\mathbb{E}[\hat{F}_n(\phi)] = F(\phi)$$
$$\mathbb{E}[\nabla_\phi \hat{F}_n(\phi)] = \nabla_\phi F(\phi)$$

Finite Sample VI objective

Given $x_1, x_2,…,x_n$ and want to learn $\theta$.
Draw $g_1,…,g_n \sim N(0,I)$
$$
\underset{\theta , \phi}{max} \hat{L}n(\theta, \phi) = \frac{1}{n} \sum_{i=1}^n \left[ logP_\theta(x_i|z=e_{\mu,\phi}(x_i)+e_{\Sigma,\phi}^{1/2}(x_i) \cdot g_i) -KL(g_\phi(z|x_i) || P_\theta(z))\right]
$$

For avove equation, we can run gradient ascent method on $\hat{L}n(\theta, \phi)$

Go back to see the whole process:
$$
data \underset{encode}{\to} Latent , space \underset{decode}{\to} new , data
$$

  • decode use $q_\phi(z|x)$
  • encode use $p_\theta(x|z)$

Reconstruction loss

Suppose $p_\theta(x|z) = N(d_{\mu, \theta}(z), d_{\Sigma, \theta}(z))$

$$
log p_\theta(x|z) = -\frac{1}{2}|x-d_{\mu, \theta}(z)|^2-\frac{1}{2}log((2\pi)^d)-\frac{1}{2}logdetd_{\Sigma, \theta}(z)
$$

$z = e_{\mu,\phi}(x)+e_{\Sigma,\phi}^{1/2}(x) \cdot g$
$z$ is new data

Thus,
$$
-\frac{1}{2}|x-d_{\mu, \theta}(z)|^2 = -\frac{1}{2}|x-d_{\mu, \theta}(e_{\mu,\phi}(x)+e_{\Sigma,\phi}^{1/2}(x) \cdot g)|^2
$$

  • $x-d_{\mu, \theta}(z)$ is decoder error
  • $e_{\mu,\phi}(x)+e_{\Sigma,\phi}^{1/2}(x) \cdot g)$ is the result of encoder
  • translation is decoder error based on encoder
  • The total equation is reconstruction error!!