Unsupervised Learning
Set up observe data $x_1,…,x_n \in \mathbb{R}^d$
$x_i \sim D$
No labels $y_1,…y_n$
- Representation statistics
a. principle component analysis (PCA)
b. covariance estimation - clustering
a. k-mean clustering
b. expectation-maximitation - Density estimation(learn the density $p(x)$ of $\mathbb{D}$)
- sampling/novel example generation
a. how to sample novel instance from $\mathbb{D}$
- Not a complete list for unsupervised learning
Energy Based Models(EBM)
An algorithm to learn a data sampler.
Input:
- $x_1,…,x_n \sim \mathbb{D}$
Output:
- $\hat{\mathbb{D}}$ that we can sample from $dist(\mathbb{D}, \hat{\mathbb{D}})$ is small.
Key ideal:
- forgot about learning the density exactly
- only learn up to normalization
$$
P_\theta(x) \propto exp(f_\theta(x))
$$
We know that $\int P_\theta(x) dx = 1 = c\int exp(f_\theta(x))dx$
Now $c=\frac{1}{\int exp(f_\theta(x))dx}$
learning EBM
we need a loss function
- principle of the maximum likehood estimation
pick $\theta$ to maximize
$$
L[\theta] = \underset{x \sim D}{\mathbb{E}}[logP_\theta(x)]
$$
$P_\theta(x) = \frac{exp(f_\theta(x))}{Z[\theta]}$, $Z[\theta] = \int exp(f_\theta(x))dx$
What does Maximum likelihood (MLE)optimize for?
Define:
$KL(p,q) = \underset{x \sim p}{\mathbb{E}} [log \frac{p(x)}{q(x)}]$
(kullback-lieber divergence)
- KL-div is not a true metric on probability distributions!
$(KL(p,q) \neq KL(q,p))$ - but $KL(p,q)=0 \Leftrightarrow p=q$
Thus, the framework of max likelihood estimation is actually learning a distribution that minimizes this KL divergence.
max likelihood->minimize KL-divergence
$$KL(P_{true}, \hat{P}) = \underset{x \sim P_{true}}{\mathbb{E}} log(\frac{P_{true}{x}}{\hat{P}(x)})$$
$$= \underset{x \sim P_{true}}{\mathbb{E}} log(P_{true}{x}) - \underset{x \sim P_{true}}{\mathbb{E}} log(\hat{P}(x))$$
Now, we need to minimize KL-div:
$$
\begin{aligned}
\underset{\hat{p}}{argmin} KL(P_{true}, \hat{P}) &= \underset{\hat{p}}{argmin} [-\underset{x \sim P_{true}}{\mathbb{E}} log(\hat{P}(x))] \\
&= \underset{\hat{p}}{argmax} [\underset{x \sim P_{true}}{\mathbb{E}} log(\hat{P}(x))] \\
&= MLE estimator
\end{aligned}
$$
why is $P_{true}$ first in the KL? (remember: KL-div is not a true metric on probability distributions)
$$
\begin{aligned}
\underset{\hat{p}}{argmin} KL(\hat{P}, P_{true}) &= \underset{x \sim P_{true}}{\mathbb{E}} log(\frac{\hat{P}(x)}{P_{true}{x}}) \\
& = \underset{x \sim P_{true}}{\mathbb{E}} log(\hat{P}(x))-\underset{x \sim P_{true}}{\mathbb{E}} log(P_{true}{x}) \\
&= \underset{\hat{p}}{argmin} [-\underset{x \sim P_{true}}{\mathbb{E}} log(P_{true}{x})] \\
&= \underset{\hat{p}}{argmax} [\underset{x \sim P_{true}}{\mathbb{E}} log(P_{true}{x})] \\
\end{aligned}
$$
Back to train (maximize)
$$
L[\theta] = \underset{x \sim D}{\mathbb{E}}[logP_\theta(x)]
$$
$P_\theta(x) = \frac{exp(f_\theta(x))}{Z[\theta]}$, $Z[\theta] = \int exp(f_\theta(x))dx$
we can use gradients, but $Z[\theta]$ will disappear.
$$
\begin{aligned}
\nabla_\theta L[\theta] &= \nabla_\theta \underset{x \sim D}{\mathbb{E}}[logP_\theta(x)] \\
&= \nabla_\theta \underset{x \sim D}{\mathbb{E}}[log(exp(f_\theta(x)))-log(Z[\theta])] \\
&= \nabla_\theta \underset{x \sim D}{\mathbb{E}}[f_\theta(x)-log(Z[\theta])] \\
&= \underset{x \sim D}{\mathbb{E}}[\nabla_\theta f_\theta(x)-\nabla_\theta log(Z[\theta])] \\
&= \underset{x \sim D}{\mathbb{E}}[\nabla_\theta f_\theta(x)-\nabla_\theta log(Z[\theta])] \\
\end{aligned}
$$
We meet problem in how to compute: $\nabla_\theta log(Z[\theta])$
$$
\nabla_\theta log(Z[\theta]) = \frac{1}{Z[\theta]} \nabla_\theta Z[\theta]
$$
use the definite of $Z[\theta] = \int exp(f_\theta(x))dx$
$$
\nabla_\theta log(Z[\theta]) = \frac{1}{Z[\theta]} \nabla_\theta \int exp(f_\theta(x))dx
$$
$$
\nabla_\theta log(Z[\theta]) = \frac{1}{Z[\theta]} \int \nabla_\theta exp(f_\theta(x))dx
$$
$$
\nabla_\theta log(Z[\theta]) = \frac{1}{Z[\theta]} \int exp(f_\theta(x)) \cdot \nabla_\theta f_\theta(x) dx
$$
$$
\nabla_\theta log(Z[\theta]) = \int \frac{exp(f_\theta(x))}{Z[\theta]} \cdot \nabla_\theta f_\theta(x) dx
$$
use the definite of $P_\theta(x) = \frac{exp(f_\theta(x))}{Z[\theta]}$
$$
\nabla_\theta log(Z[\theta]) = \int P_\theta(x) \cdot \nabla_\theta f_\theta(x) dx
$$
$$
\nabla_\theta log(Z[\theta]) = \underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]
$$
plugin back to $\nabla_\theta L[\theta]$
$$
\begin{aligned}
\nabla_\theta L[\theta] &= \underset{x \sim D}{\mathbb{E}}[\nabla_\theta f_\theta(x)] - \underset{x \sim D}{\mathbb{E}}[\underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]]\\
&= \underset{x \sim D}{\mathbb{E}}[\nabla_\theta f_\theta(x)] - \underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]\\
\end{aligned}
$$
- the first is true data dist
- the second is current EBM model
Thus, if these two distribution is similar or even same, the gradient should be similar to zero.
Define a finite sample gradient:
$$
\nabla_\theta \hat{L_n}[\theta] = \frac{1}{n} \sum^n_{i=1} \nabla_\theta f_\theta(x_i) - \frac{1}{n} \sum^n_{i=1} \nabla_\theta f_\theta(\bar{x_i})
$$
- $x_1, …. x_n \sim \mathbb{D}$, come from finite true data set
- $\bar{x_1}, …. \bar{x_n} \sim P_\theta(.)$, come from model
Sampling from $P_\theta(.)$
How does one draw samples from $P_\theta(.)$?
If $X$ is discrate, this is trivral.
Compute $P_\theta(1), …., P_\theta(n)$
Sample weighted dist ${P_\theta(i)}^n_{i=1}$
If $X=\mathbb{R}^d$, then what?
use Langevin sample
Langevin sample
Construct a process ${Z_t}^T_{t=0}$
$$
Z_{t+1} = Z_t + \eta \nabla_x log p_\theta(Z_t) + \sqrt{2\eta} \cdot W_t
$$
- $W_t \sim N(0, \mathbb{R}^d)$, which is a random gaussion vector
- $\eta \gt 0$, is the step size
- $Z_0 = 0$, whatever
$log p_\theta(x) = f_\theta(x)-logZ[\theta]$
If we want to compute $\nabla_x log p_\theta(x)$
remeber that $logZ[\theta]$ is not dependent on $x$
Thus,
$\nabla_x log p_\theta(x) = \nabla_x f_\theta(x)$
output: $Z_t$
calculate for Discrate EBMs
$X = [m]$
$P_\theta(x=i) \propto exp(\theta_t)$, $i \leftarrow [m]$.
$$
P_\theta(x=i) = \frac{exp(\theta_i)}{\sum^m_{j=1} exp(\theta_j)}
$$
why brother? Let’s just parameterize $P_\theta(x=i) = p_i$
- $\sum^m_{i=1} p_i = 1$
- $_i \geq 0$, $\forall i \in [m]$
Computing Gradient $\nabla_\theta L[\theta]$ under discrate EBMs:
$$
\nabla_\theta L[\theta] = p_*-p_\theta
$$
- $p_*$ is the true dist, $p_* \in \mathbb{R}^m$
- $p_\theta \in \mathbb{R}^m$, which $P_\theta(x=i) = \frac{exp(\theta_i)}{\sum^m_{j=1} exp(\theta_j)}$
look at Gradient ascent:
$\theta_{t+1} = \theta_t + \eta \nabla_\theta L[\theta_t]$
$$
P_{\theta_{t+1}}(i) = \frac{P_{\theta_t}(i) \cdot exp(\eta \nabla_t(i))}{\sum^m_{j=1}P_{\theta_t}(i) \cdot exp(\eta \nabla_t(i))}
$$
$\nabla_t = P_*-P_\theta$
Now go to calculate:
$$
\begin{aligned}
\nabla_\theta L[\theta] &= \underset{x \sim P_{true}}{\mathbb{E}}[\nabla_\theta f_\theta(x)] - \underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]\\
&=\underset{i \sim P_*}{\mathbb{E}}[e_i]\\
&=P_i
\end{aligned}
$$
- $P_\theta(i) \propto exp(\theta_i)$
- $\nabla f_\theta(x=i) = \nabla_\theta \theta_i = e_i$
Calculate for EBMs on Gaussion
Assume that $P_{true} = N(\mu, I)$
$P_{true} = \frac{exp(f_\theta(x))}{Z[\theta]}$
$f_\theta(x) = -\frac{1}{2} |X-\mu|^2$
$\theta = \mu$ recovers $P_{true}$
$F={f_\theta(x) = -\frac{1}{2} |X-\mu|^2 | \theta \in \mathbb{R}^d}$
$\nabla_\theta f_\theta(x) = x-\theta$
Now go to calculate:
$$
\begin{aligned}
\nabla_\theta L[\theta] &= \underset{x \sim P_{true}}{\mathbb{E}}[\nabla_\theta f_\theta(x)] - \underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]\\
& = \underset{x \sim P_{true}}{\mathbb{E}}[x-\theta]-\underset{x \sim P_\theta}{\mathbb{E}}[x-\theta]\\
&=\mu - \theta
\end{aligned}
$$
- The second term is zero?