ERM

Find $\hat{f}_{erm} \in F$, such that $\hat{f}_{erm} = arg min_{f \in F} \frac{1}{n} \sum_n^{i=1} l(f(x_i), y_i)$

(Stochastic) Gradient methods

Assume parametric form for $F$.
$$F= \{ f_\theta(x) | \theta \in \Theta\}$$
$$\Theta \in \mathbb{R}^p$$
$\Theta$ is a p dimensional vector and which going to parameterize the weights of our model.

Here is some example:

(linear classes) $f_\theta(x) = <\theta, x>$

(Hidden layer of neural network) $f_\theta(x)= W_1 \cdot \sigma(W_2 x) $
$\theta = (W_1, W_2)$

Gradient descent

In general:
$F: \mathbb{R}^P \mapsto \mathbb{R}$
$F$ is at least once differentiable
We want to find $min_{\theta \in \mathbb{R}^P} F(\theta)$
Basic algorithm:
$\theta_{t+1} = \theta_t - \eta \Delta F(\theta_t)$
$\eta \gt 0$ is step size

Not alwyas have minimum.

Applying to ERM
$$F = \frac{1}{n} \sum_n^{i=1} l(f(x_i), y_i)$$

Q:

  1. What does $\theta_t$ converge to?
  2. How should we set $\eta$?

A:

Use Quadratic functions as a case study for Gradient descent.
pick $F(\theta) = \frac{1}{2} \theta^TQ\theta - \theta^Tp$,
$Q$ is positive definite, and $p \in \mathbb{R}^d$.
What is the $arg \min_{\theta \in \mathbb{R}^P} F(\theta)$?
$\nabla F(\theta) = Q \theta -p = 0 \Rightarrow \theta_*= Q^{-1}p$
$Q$ is invertible due to it is positive definite. all its egenvalues are non negative, it has full rank, it is invertible.

What does Gradient Descent do?
$$
\begin{aligned}
\theta_{t+1} &= \theta_t - \eta \cdot \nabla F(\theta_t) \\
&=\theta_t-\eta \cdot [Q \cdot \theta_t-p] \\
\end{aligned}
$$

let delta t equal to the difference between theta t and theta star.
$\nabla_t = \theta_t - \theta_*$
$\nabla_{t+1} = \theta_{t+1} - \theta_*$
$\nabla_{t+1} = \theta_t-\eta \cdot [Q \cdot \theta_t-p] - \theta_*$
$\nabla_{t+1} = \theta_t-\eta \cdot Q[ \theta_t-Q^{-1} \cdot p] - \theta_*$
because $Q^{-1} \cdot p = \theta_*$
Thus, $\nabla_{t+1} = \theta_t - \theta_*-\eta \cdot Q[ \theta_t-\theta_*] $
$\nabla_{t+1} = \nabla_t-\eta \cdot Q[ \nabla_t] $
$\nabla_{t+1} = [I-\eta \cdot Q] \cdot \nabla_t $

Thus, we can get the equation:
$\nabla_{t} = [I-\eta \cdot Q]^t \cdot \nabla_0 $

What is the eigenvalues of $I-\eta \cdot Q$?
Let $\lambda_1 \geq \lambda_2 \geq … \geq \lambda_d$ are eigenvalues of Q
$\{ 1-\eta \lambda_i\}^d_{i=1}$ are the eigenvalue of $I-\eta \cdot Q$
We need $\{ 1-\eta \lambda_i\}^d_{i=1}$ always less than 1, then when $t$ increate, the $\nabla_{t}$ can grow exponentially small.
Now, if we need $ 1-\eta \lambda_i \geq 0$
Which means $\eta \leq \frac{1}{\lambda_i} \quad \forall i$

Now, we need to make $1-\eta \lambda_i$ as small as possible, we need $\eta$ as large as we can, a large step size as we can.
$\eta$ should saturates this constraint.
So, the largest eigenvalue is $1-\frac{min(\lambda_Q)}{max(\lambda_Q)}$
Thus, $eigval(I-\eta Q) \in [0, 1-\frac{min(\lambda_Q)}{max(\lambda_Q)}]$
Then, there is a define:
$\forall v, \quad ||(I-\eta Q)v|| \leq (1-\frac{min(\lambda_Q)}{max(\lambda_Q)})|v|$
Thus max single value of $(I-\eta Q) \leq 1-\frac{min(\lambda_Q)}{max(\lambda_Q)}$
Thus, $|\nabla_t| \leq (1-\frac{min(\lambda_Q)}{max(\lambda_Q)})^t \cdot |\nabla_0|$
$\nabla_0$ is inital error
Thus, as $t \mapsto \infty $, the error goes to zero.

GD converges to stationary point, $(\nabla F = 0)$

We are going to show $|\nabla F(\theta_t)| \mapsto 0$.
But, we need to pay attentation on local minimum and saddle point. we will not consider these point.

Assume $F$ is “L-smooth”:
$\forall u,v \in \mathbb{R}^d$
$F(v) \leq F(u)+\nabla F(u) \cdot (v-u)+\frac{L}{2} ||u-v||^2$
The gradient $\nabla F$ is L-lipschitz.
$|\nabla F(u) - \nabla F(v)| \leq L||u-v||$

Now, suppose that $F$ is “L-smooth”, and $F \geq 0$, for any $0 \lt \eta \leq \frac{1}{L}$.
GD satisfies:
$$
min_{t \in (0,…,T-1)}|\nabla F(\theta_t)|^2 \leq \frac{2F(\theta_0)}{\eta \cdot T}
$$

Proof:
Apply L-smooth
$$
F(\theta_{t+1}) \leq F(\theta_t)+\nabla F(\theta_t) \cdot (\theta_{t+1}-\theta_t)+\frac{L}{2} |\theta_{t+1}-\theta_t|^2
$$
We know that $\theta_{t+1}-\theta_t = -\eta \nabla F(\theta_t)$
Thus:
$$
F(\theta_{t+1}) \leq F(\theta_t)+\eta |\nabla F(\theta_t)|^2 +\frac{L \eta^2}{2} |\nabla F(\theta_t)|^2
$$

$\eta |\nabla F(\theta_t)|^2$ looks like descent on $F$
$\frac{L \eta^2}{2} |\nabla F(\theta_t)|^2$ looks like error of quadratic approx.
$$
F(\theta_{t+1}) \leq F(\theta_t)-\eta(1-\frac{L \eta}{2}) |\nabla F(\theta_t)|^2
$$
We need to set $1-\frac{L \eta}{2} \geq \frac{1}{2}$
Which means $\eta \leq \frac{1}{L}$
thus
$$
F(\theta_{t+1}) \leq F(\theta_t)-\frac{\eta}{2} |\nabla F(\theta_t)|^2
$$

This, means GD actually makes some progress every iter unless $\nabla F=0$

$$
F(\theta_{t+1}) \leq F(\theta_{t-1})-\frac{\eta}{2} |\nabla F(\theta_{t-1})|^2-\frac{\eta}{2} |\nabla F(\theta_t)|^2
$$

$$
F(\theta_{t}) \leq F(\theta_0) - \frac{\eta}{2} \sum^{t-1}_{i=0}|\nabla F(\theta_i)|^2
$$

$$
\sum^{t-1}_{i=0}|\nabla F(\theta_i)|^2 \leq \frac{2(F(\theta_0) - F(\theta_t))}{\eta}
$$

due to $F \geq 0$

$$
\sum^{t-1}_{i=0}|\nabla F(\theta_i)|^2 \leq \frac{2F(\theta_0)}{\eta}
$$

$$
\sum^{t-1}_{i=0}|\nabla F(\theta_i)|^2 \geq t \cdot min_{i \in (0, …,t-1)} | \nabla F(\theta_i)|^2
$$

Then:

$$
min_{i \in (0,…,T-1)}|\nabla F(\theta_i)|^2 \leq \frac{2F(\theta_0)}{\eta \cdot t}
$$

Stochastic GD(SGD)

GD on ERM:
$$
\nabla_\theta \hat{L_n}[\theta] = \frac{1}{\theta} \sum^n_{i=1}[L(f_\theta(x_i), y_i)]
$$

Computation complexity is $O(n)$

Build a stochastic approx to $\nabla_\theta \hat{L_n}[\theta]$, that only takes $O(1)$ computation.

$i_t \sim unif{1,…,n}$

$g_t$ is gradient estimate
$$
g_t = \nabla_\theta L(f_{\theta t}(x_{it}), y_{it})
$$

$$
\mathbb{E}[g_t|i_1,….,i_{t-1}] = \frac{1}{n} \sum^n_{i=1} \nabla_\theta L(f_{\theta t}(x_i), y_i) = \nabla_\theta \hat{L_n}[\theta_t]
$$

key modifications in practise

  1. mini-batching
  2. epochs + reshuffling

step size

GD: constant ($\eta = 1/2$) step size.
SGD: decaying step sizes(time-varying step size),
classical SGD theory: $\eta_t = \frac{1}{\sqrt{t}}$, $\eta_t = \frac{1}{t}$,