Latent Variable models
Data is generated by:
$Z \sim p(z)$ [Latent variable]
$X \sim p(x|z)$ [Observation]
only $X$ is revealed to the learner
Example: Gaussion distribution(GMM)
k-Gaussion distributions
$(\mu_1, \sigma_1),…,(\mu_k, \sigma_k)$
draw index $i \in [k] \sim $ Categorical ($\pi_1,…,\pi_k$)
draw $X|i\sim N(\mu_i, \sigma_i^2)$
observe $X$
Example: Binomial model
Two coins: $P1(head), P2(tail)$
Draw an index $i \in [2] \sim Cat(\pi_1, \pi_2)$
Draw 10 examples ( iid ) from bernoulli $(Pi)$
Observe $n_H \in {0,1,…,10}$
Algorithm
Goal of learning:
Given iid examples: $x_1,…x_n \sim p(x)$
$p(x)$ has latent structure: $p(x)=\int p(x|z)p(z)dz$
If we directly optimize the log-likeihood:
$$
\begin{aligned}
logP_\theta(x_1,…x_n) &= \sum_{i=1}^n logP_\theta(x_i) \\
& = \sum_{i=1}^n log \int P_\theta(x_i|z_i) p_\theta(z_i)d_{z_i}
\end{aligned}
$$
Suppose we want to learn GMM:
$$
log \int P_\theta(x_i|z_i) p_\theta(z_i)d_{z_i} \\
= log [\sum_{j=1}^k P_\theta(x_i|z_i=j) \pi_j] \\
= log [\sum_{j=1}^k \frac{1}{C(\sigma_j^2)} exp(-\frac{1}{2\sigma_j^2}(x_i-\mu_j)^2) \pi_j]
$$
Suppose we want to learn the bernoulli model:
$$
log \int P_\theta(x_i|z_i) p_\theta(z_i)d_{z_i} \\
= log[P_\theta(x_i|z_i=1) \cdot \pi_1+ P_\theta(x_i|z_i=2) \cdot \pi_2]\\
= log\left[\left(\underset{n_H}{10}\right) P_1^{n_H}(1-P_1)^{10-n_H}\cdot \pi_1 + \left(\underset{n_H}{10}\right) P_2^{n_H}(1-P_2)^{10-n_H}\cdot \pi_2 \right]
$$
Expection-maximization(EM)
This is an iterative algoritm $\theta_0,…\theta_n$
E-step:
$$Q(\theta, \theta_t) = \underset{P_{\theta_t}(Z_{1:n}|X_{1:n})}{\mathbb{E}} \left[ log P_\theta(X_{1:n}, Z_{1:n}) \right]$$
M-step:
$$
\theta_{t+1} = \underset{\theta}{argmax} \quad Q(\theta, \theta_t)
$$
Thm
For any data $X_{1:n}$ :
$$
logP_{\theta_{t+1}}(X_{1:n}) \geq logP_{\theta_{t}}(X_{1:n})
$$
Example: apply Em to bernoulli model
E-step
$\theta_t = (P_{t,1}, P_{t,2}) \in (0,1)^2$
Assume for each coin has 1/2 prob of being selected ($\pi_1, \pi_2 = 1/2$)
$Z \in {1,2}$
$X \in {0,1,…,10}$
$$
P_{\theta_t}(Z|X) = \frac{P_{\theta_t}(Z,X)}{P_{\theta_t}(X)}
$$
small trick
$$
P(X,Z) = P(X)P(Z|X)=P(Z)P(X|Z)\\
P(Z|X) = \frac{P(Z,X)}{P(X)}
$$
$$
\begin{aligned}
from \quad above &=\frac{P_{\theta_t}(Z,X)}{P_{\theta_t}(Z=1,X)+P_{\theta_t}(Z=2,X)} \\
&=\frac{P_{\theta_t}(Z)P_{\theta_t}(X|Z)}{P_{\theta_t}(Z=1,X)+P_{\theta_t}(Z=2,X)}\\
&=\frac{P_{\theta_t}(Z)P_{\theta_t}(X|Z)}{P_{\theta_t}(Z=1)P_{\theta_t}(X|Z=1)+P_{\theta_t}(Z=2)P_{\theta_t}(X|Z=2)}
\end{aligned}
$$
small trick
$$
P(Z=1)=P(Z=2)=\frac{1}{2}
$$
$$
\begin{aligned}
from \quad above &= \frac{P_{\theta_t}(X|Z)}{P_{\theta_t}(X|Z=1)+P_{\theta_t}(X|Z=2)}
\end{aligned}
$$
small trick: PMF of a bionmial (10, p)
$$
P_{\theta_t}(X|Z=i) = \left( \underset{X}{10} \right) P_i^X(1-P_i)^{10-X}
$$
M-step
$$
\begin{aligned}
Q(\theta,\theta_t)&=\underset{P_{\theta_t}(Z_{1:n}|X_{1:n})}{\mathbb{E}}[log P_\theta(Z_{1:n}|X_{1:n})]\\
&=\sum_{i=1}^n \underset{P_{\theta_t}(Z_i|X_i)}{\mathbb{E}}[log P_\theta(Z_i,X_i)]\\
&=\sum_{i=1}^n \underset{P_{\theta_t}(Z_i|X_i)}{\mathbb{E}}[log(\frac{1}{2}) \cdot log P_\theta(X_i|Z_i)]\\
&=-nlog2+\sum_{i=1}^n \underset{P_{\theta_t}(Z_i|X_i)}{\mathbb{E}}[log P_\theta(X_i|Z_i)]\\
&=-nlog2+\sum_{i=1}^n \sum_{j=1}^2\underset{P_{\theta_t}(Z_i|X_i)}{\mathbb{E}}[log P_\theta(X_i|Z_i=j)P_{\theta_t}(Z_i=j|X_i)]\\
\end{aligned}
$$
small trick
$$
\begin{aligned}
logP_\theta(X_i|Z_i=j) &= log[\left( \underset{X_i}{10} \right) P_j^{X_i}(1-P_j)^{10-X_i}]\\
&=log\left( \underset{X_i}{10} \right)+X_ilogP_j+(10-X_i)log(1-P_j)
\end{aligned}
$$
We treat $P_{\theta_t}(Z_i=j|X_i)$ as $r_{\theta_t}(j:X_i)$
If we now optimize $Q(\theta, \theta_t)$ we get:
$$
P_{t+1, j} = \frac{\sum_{i=1}^n X_i r_{\theta_t}(j:X_i)}{10 \sum_{i=1}^n r_{\theta_t}(j:X_i)} \quad , \quad j \in \{1,2\}
$$
$$
r_{\theta_t}(j:X_i) = prob_{\theta_t}(X_i \text{ was sampled from the } j^{th} \text{ coin})
$$
Example: apply Em to Gaussion mixture model
$Z \sim Cat(\pi_1,…,\pi_k)$
$X|(Z=i) \sim N(\mu_i, \Sigma_i)$
$[X \in \mathbb{R}^d, Z \in {1,…,k}]$
E-step
We want computer $P_{\theta_t}(Z|X)$:
Here is parameters we take care:
$$\theta_t = ( \{\mu_i, \Sigma_i\}_{i=1}^k, \{\ pi_i \}_{i=1}^k )$$
Let define following:
$$
\phi(X;\mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^d det\Sigma}}exp(-\frac{1}{2}|X-\mu|_{\Sigma^{-1}}^2)
$$
Remember
$X^T\Sigma^{-1} X = |X|_{\Sigma^{-1}}^2$
$$
\begin{aligned}
P_{\theta_t}(Z=i|X) &= \frac{P_{\theta_t}(Z=i,X)}{\sum_{j=1}^k P_{\theta_t}(Z=j,X)}\\
&=\frac{\pi_i \phi(X;\mu_i,\Sigma_i)}{\sum_{j=1}^k\pi_j \phi(X;\mu_j,\Sigma_j)}
\end{aligned}
$$
We treat $P_{\theta_t}(Z=j|X_i)$ as $r_{\theta_t}(j:X_i)$
$$
\begin{aligned}
Q(\theta, \theta_t) &= \sum_{i=1}^n \underset{P_{\theta_t}(Z_i|X_i)}{\mathbb{E}}[log p_\theta(Z_i,X_i)]\\
&=\sum_{i=1}^n \sum_{j=1}^k log P_\theta (Z_i,X_i) \cdot P_{\theta_t}(Z_i=j|X_i)\\
&=\sum_{i=1}^n \sum_{j=1}^k log P_\theta (Z_i,X_i) \cdot r_{\theta_t}(j:X_i)
\end{aligned}
$$
$log P_\theta (Z_i,X_i) = log \pi_j + log \phi(X_i; \mu_j, \Sigma_j)$
$$
\begin{aligned}
From \quad above &= \sum_{i=1}^n \sum_{j=1}^k log P_\theta (Z_i,X_i) \cdot r_{\theta_t}(j:X_i) \\
&= \sum_{i=1}^n \sum_{j=1}^k [log \pi_j + log \phi(X_i; \mu_j, \Sigma_j)] \cdot r_{\theta_t}(j:X_i) \\
&= \sum_{i=1}^n \sum_{j=1}^k [log \pi_j + log \phi(X_i; \mu_j, \Sigma_j)] \cdot r_{\theta_t}(j:X_i) \\
&= \sum_{j=1}^k[\sum_{i=1}^n r_{\theta_t}(j:X_i)] \cdot log \pi_j + \sum_{i=1}^n \sum_{j=1}^klog \phi(X_{ij},\mu_j,\Sigma_j) \cdot r_{\theta_t}(j:X_i)
\end{aligned}
$$
$Q(\theta, \theta_t) \to \sum_{i=1}^n \sum_{j=1}^klog \phi(X_{ij},\mu_j,\Sigma_j) \cdot r_{\theta_t}(j:X_i)$
M-step
- prior parameter $\pi_j$:
$$
\begin{aligned}
\pi_j^{t+1} &= \frac{\sum_{i=1}^n r_{\theta_t}(j;X_i)}{n} \quad ,j\in[k] \\
&=\frac{1}{n} [r_{\theta_t}(j)]
\end{aligned}
$$
means:
$$
\mu_j^{t+1} = \sum_{i=1}^n \frac{r_{\theta_t}(j:X_i)}{r_{\theta_t}(j)}X_i
$$covariance:
$$
\Sigma_j^{t+1} = \sum_{i=1}^n \frac{r_{\theta_t}(j:X_i)}{r_{\theta_t}(j)}(X_i-\mu_j^{t+1})(X_i-\mu_j^{t+1})^T
$$
Monotonic improvement:
EM satisfied:
For any data $X_{1:n}$ :
$$
logP_{\theta_{t+1}}(X_{1:n}) \geq logP_{\theta_{t}}(X_{1:n})
$$
Proof
Abbreviate $X =X_{1:n}$, $Z=Z_{1:n}$
OBS: Bayes rule:
$$
P_{\theta_{t+1}}(X) = \frac{P_{\theta_{t+1}}(Z,X)}{P_{\theta_{t+1}}(Z|X)}
$$
Takeing log:
$$
\begin{aligned}
logP_{\theta_{t+1}}(X) &= logP_{\theta_{t+1}}(Z,X)-logP_{\theta_{t+1}}(Z|X) \\
&=\underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}[logP_{\theta_{t+1}}(Z,X)-logP_{\theta_{t+1}}(Z|X)]\\
&=Q(\theta_{t+1}, \theta_t)-\underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}[logP_{\theta_{t+1}}(Z|X)]\\
&\geq Q(\theta_t, \theta_t)-\underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}[logP_{\theta_{t+1}}(Z|X)]\\
&= Q(\theta_t, \theta_t)-\underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}[log[P_{\theta_{t+1}}(Z|X) \cdot \frac{P_{\theta_t}(Z|X)}{P_{\theta_t}(Z|X)}]]\\
&= Q(\theta_t, \theta_t)-\underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}logP_{\theta_t}(Z|X)- \underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}log[\frac{P_{\theta_{t+1}}(Z|X)}{P_{\theta_t}(Z|X)}]\\
&= Q(\theta_t, \theta_t) - \underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}logP_{\theta_t}(Z|X) + \underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}[log\frac{P_{\theta_t}(Z|X)}{P_{\theta_{t+1}}(Z|X)}]\\
\end{aligned}
$$
KL
For two dist $p$ and $q$
$$KL(p||q) = \underset{p}{\mathbb{E}} ; log ; \frac{p}{q}$$
KL divergence is not negative!!
$$
\underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}[log\frac{P_{\theta_t}(Z|X)}{P_{\theta_{t+1}}(Z|X)}] = KL(P_{\theta_t}(Z|X)||P_{\theta_{t+1}}(Z|X)) \geq 0
$$
$$
\begin{aligned}
from \quad above &\geq Q(\theta_t, \theta_t) - \underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}logP_{\theta_t}(Z|X)\\
&= \underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}[logP_{\theta_t}(X,Z)-logP_{\theta_t}(Z,X)]\\
&= \underset{Z \sim P_{\theta_t}(Z|X)}{\mathbb{E}}[logP_{\theta_t}(X)]\\
&= logP_{\theta_t}(X)\\
\end{aligned}
$$
Thus:
$$
logP_{\theta_{t+1}}(X) \geq logP_{\theta_t}(X)
$$