Unsupervised Learning

Set up observe data $x_1,…,x_n \in \mathbb{R}^d$
$x_i \sim D$
No labels $y_1,…y_n$

  1. Representation statistics
    a. principle component analysis (PCA)
    b. covariance estimation
  2. clustering
    a. k-mean clustering
    b. expectation-maximitation
  3. Density estimation(learn the density $p(x)$ of $\mathbb{D}$)
  4. sampling/novel example generation
    a. how to sample novel instance from $\mathbb{D}$
  • Not a complete list for unsupervised learning

Energy Based Models(EBM)

An algorithm to learn a data sampler.

Input:

  • $x_1,…,x_n \sim \mathbb{D}$

Output:

  • $\hat{\mathbb{D}}$ that we can sample from $dist(\mathbb{D}, \hat{\mathbb{D}})$ is small.

Key ideal:

  • forgot about learning the density exactly
  • only learn up to normalization

$$
P_\theta(x) \propto exp(f_\theta(x))
$$

We know that $\int P_\theta(x) dx = 1 = c\int exp(f_\theta(x))dx$
Now $c=\frac{1}{\int exp(f_\theta(x))dx}$

learning EBM

we need a loss function

  • principle of the maximum likehood estimation

pick $\theta$ to maximize
$$
L[\theta] = \underset{x \sim D}{\mathbb{E}}[logP_\theta(x)]
$$
$P_\theta(x) = \frac{exp(f_\theta(x))}{Z[\theta]}$, $Z[\theta] = \int exp(f_\theta(x))dx$

What does Maximum likelihood (MLE)optimize for?
Define:
$KL(p,q) = \underset{x \sim p}{\mathbb{E}} [log \frac{p(x)}{q(x)}]$
(kullback-lieber divergence)

  • KL-div is not a true metric on probability distributions!
    $(KL(p,q) \neq KL(q,p))$
  • but $KL(p,q)=0 \Leftrightarrow p=q$

Thus, the framework of max likelihood estimation is actually learning a distribution that minimizes this KL divergence.

max likelihood->minimize KL-divergence

$$KL(P_{true}, \hat{P}) = \underset{x \sim P_{true}}{\mathbb{E}} log(\frac{P_{true}{x}}{\hat{P}(x)})$$

$$= \underset{x \sim P_{true}}{\mathbb{E}} log(P_{true}{x}) - \underset{x \sim P_{true}}{\mathbb{E}} log(\hat{P}(x))$$

Now, we need to minimize KL-div:
$$
\begin{aligned}
\underset{\hat{p}}{argmin} KL(P_{true}, \hat{P}) &= \underset{\hat{p}}{argmin} [-\underset{x \sim P_{true}}{\mathbb{E}} log(\hat{P}(x))] \\
&= \underset{\hat{p}}{argmax} [\underset{x \sim P_{true}}{\mathbb{E}} log(\hat{P}(x))] \\
&= MLE estimator
\end{aligned}
$$

why is $P_{true}$ first in the KL? (remember: KL-div is not a true metric on probability distributions)
$$
\begin{aligned}
\underset{\hat{p}}{argmin} KL(\hat{P}, P_{true}) &= \underset{x \sim P_{true}}{\mathbb{E}} log(\frac{\hat{P}(x)}{P_{true}{x}}) \\
& = \underset{x \sim P_{true}}{\mathbb{E}} log(\hat{P}(x))-\underset{x \sim P_{true}}{\mathbb{E}} log(P_{true}{x}) \\
&= \underset{\hat{p}}{argmin} [-\underset{x \sim P_{true}}{\mathbb{E}} log(P_{true}{x})] \\
&= \underset{\hat{p}}{argmax} [\underset{x \sim P_{true}}{\mathbb{E}} log(P_{true}{x})] \\
\end{aligned}
$$

Back to train (maximize)

$$
L[\theta] = \underset{x \sim D}{\mathbb{E}}[logP_\theta(x)]
$$
$P_\theta(x) = \frac{exp(f_\theta(x))}{Z[\theta]}$, $Z[\theta] = \int exp(f_\theta(x))dx$

we can use gradients, but $Z[\theta]$ will disappear.

$$
\begin{aligned}
\nabla_\theta L[\theta] &= \nabla_\theta \underset{x \sim D}{\mathbb{E}}[logP_\theta(x)] \\
&= \nabla_\theta \underset{x \sim D}{\mathbb{E}}[log(exp(f_\theta(x)))-log(Z[\theta])] \\
&= \nabla_\theta \underset{x \sim D}{\mathbb{E}}[f_\theta(x)-log(Z[\theta])] \\
&= \underset{x \sim D}{\mathbb{E}}[\nabla_\theta f_\theta(x)-\nabla_\theta log(Z[\theta])] \\
&= \underset{x \sim D}{\mathbb{E}}[\nabla_\theta f_\theta(x)-\nabla_\theta log(Z[\theta])] \\
\end{aligned}
$$

We meet problem in how to compute: $\nabla_\theta log(Z[\theta])$
$$
\nabla_\theta log(Z[\theta]) = \frac{1}{Z[\theta]} \nabla_\theta Z[\theta]
$$
use the definite of $Z[\theta] = \int exp(f_\theta(x))dx$
$$
\nabla_\theta log(Z[\theta]) = \frac{1}{Z[\theta]} \nabla_\theta \int exp(f_\theta(x))dx
$$
$$
\nabla_\theta log(Z[\theta]) = \frac{1}{Z[\theta]} \int \nabla_\theta exp(f_\theta(x))dx
$$
$$
\nabla_\theta log(Z[\theta]) = \frac{1}{Z[\theta]} \int exp(f_\theta(x)) \cdot \nabla_\theta f_\theta(x) dx
$$
$$
\nabla_\theta log(Z[\theta]) = \int \frac{exp(f_\theta(x))}{Z[\theta]} \cdot \nabla_\theta f_\theta(x) dx
$$
use the definite of $P_\theta(x) = \frac{exp(f_\theta(x))}{Z[\theta]}$
$$
\nabla_\theta log(Z[\theta]) = \int P_\theta(x) \cdot \nabla_\theta f_\theta(x) dx
$$
$$
\nabla_\theta log(Z[\theta]) = \underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]
$$

plugin back to $\nabla_\theta L[\theta]$
$$
\begin{aligned}
\nabla_\theta L[\theta] &= \underset{x \sim D}{\mathbb{E}}[\nabla_\theta f_\theta(x)] - \underset{x \sim D}{\mathbb{E}}[\underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]]\\
&= \underset{x \sim D}{\mathbb{E}}[\nabla_\theta f_\theta(x)] - \underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]\\
\end{aligned}
$$

  • the first is true data dist
  • the second is current EBM model

Thus, if these two distribution is similar or even same, the gradient should be similar to zero.

Define a finite sample gradient:
$$
\nabla_\theta \hat{L_n}[\theta] = \frac{1}{n} \sum^n_{i=1} \nabla_\theta f_\theta(x_i) - \frac{1}{n} \sum^n_{i=1} \nabla_\theta f_\theta(\bar{x_i})
$$

  • $x_1, …. x_n \sim \mathbb{D}$, come from finite true data set
  • $\bar{x_1}, …. \bar{x_n} \sim P_\theta(.)$, come from model

Sampling from $P_\theta(.)$

How does one draw samples from $P_\theta(.)$?

If $X$ is discrate, this is trivral.
Compute $P_\theta(1), …., P_\theta(n)$
Sample weighted dist ${P_\theta(i)}^n_{i=1}$

If $X=\mathbb{R}^d$, then what?
use Langevin sample

Langevin sample

Construct a process ${Z_t}^T_{t=0}$
$$
Z_{t+1} = Z_t + \eta \nabla_x log p_\theta(Z_t) + \sqrt{2\eta} \cdot W_t
$$

  • $W_t \sim N(0, \mathbb{R}^d)$, which is a random gaussion vector
  • $\eta \gt 0$, is the step size
  • $Z_0 = 0$, whatever

$log p_\theta(x) = f_\theta(x)-logZ[\theta]$
If we want to compute $\nabla_x log p_\theta(x)$
remeber that $logZ[\theta]$ is not dependent on $x$
Thus,
$\nabla_x log p_\theta(x) = \nabla_x f_\theta(x)$

output: $Z_t$

calculate for Discrate EBMs

$X = [m]$
$P_\theta(x=i) \propto exp(\theta_t)$, $i \leftarrow [m]$.
$$
P_\theta(x=i) = \frac{exp(\theta_i)}{\sum^m_{j=1} exp(\theta_j)}
$$

why brother? Let’s just parameterize $P_\theta(x=i) = p_i$

  • $\sum^m_{i=1} p_i = 1$
  • $_i \geq 0$, $\forall i \in [m]$

Computing Gradient $\nabla_\theta L[\theta]$ under discrate EBMs:
$$
\nabla_\theta L[\theta] = p_*-p_\theta
$$

  • $p_*$ is the true dist, $p_* \in \mathbb{R}^m$
  • $p_\theta \in \mathbb{R}^m$, which $P_\theta(x=i) = \frac{exp(\theta_i)}{\sum^m_{j=1} exp(\theta_j)}$

look at Gradient ascent:
$\theta_{t+1} = \theta_t + \eta \nabla_\theta L[\theta_t]$
$$
P_{\theta_{t+1}}(i) = \frac{P_{\theta_t}(i) \cdot exp(\eta \nabla_t(i))}{\sum^m_{j=1}P_{\theta_t}(i) \cdot exp(\eta \nabla_t(i))}
$$

$\nabla_t = P_*-P_\theta$

Now go to calculate:
$$
\begin{aligned}
\nabla_\theta L[\theta] &= \underset{x \sim P_{true}}{\mathbb{E}}[\nabla_\theta f_\theta(x)] - \underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]\\
&=\underset{i \sim P_*}{\mathbb{E}}[e_i]\\
&=P_i
\end{aligned}
$$

  • $P_\theta(i) \propto exp(\theta_i)$
  • $\nabla f_\theta(x=i) = \nabla_\theta \theta_i = e_i$

Calculate for EBMs on Gaussion

Assume that $P_{true} = N(\mu, I)$
$P_{true} = \frac{exp(f_\theta(x))}{Z[\theta]}$
$f_\theta(x) = -\frac{1}{2} |X-\mu|^2$
$\theta = \mu$ recovers $P_{true}$

$F={f_\theta(x) = -\frac{1}{2} |X-\mu|^2 | \theta \in \mathbb{R}^d}$

$\nabla_\theta f_\theta(x) = x-\theta$

Now go to calculate:
$$
\begin{aligned}
\nabla_\theta L[\theta] &= \underset{x \sim P_{true}}{\mathbb{E}}[\nabla_\theta f_\theta(x)] - \underset{x \sim P_\theta}{\mathbb{E}}[\nabla_\theta f_\theta(x)]\\
& = \underset{x \sim P_{true}}{\mathbb{E}}[x-\theta]-\underset{x \sim P_\theta}{\mathbb{E}}[x-\theta]\\
&=\mu - \theta
\end{aligned}
$$

  • The second term is zero?