ERM
Find $\hat{f}_{erm} \in F$, such that $\hat{f}_{erm} = arg min_{f \in F} \frac{1}{n} \sum_n^{i=1} l(f(x_i), y_i)$
(Stochastic) Gradient methods
Assume parametric form for $F$.
$$F= \{ f_\theta(x) | \theta \in \Theta\}$$
$$\Theta \in \mathbb{R}^p$$
$\Theta$ is a p dimensional vector and which going to parameterize the weights of our model.
Here is some example:
(linear classes) $f_\theta(x) = <\theta, x>$
(Hidden layer of neural network) $f_\theta(x)= W_1 \cdot \sigma(W_2 x) $
$\theta = (W_1, W_2)$
Gradient descent
In general:
$F: \mathbb{R}^P \mapsto \mathbb{R}$
$F$ is at least once differentiable
We want to find $min_{\theta \in \mathbb{R}^P} F(\theta)$
Basic algorithm:
$\theta_{t+1} = \theta_t - \eta \Delta F(\theta_t)$
$\eta \gt 0$ is step size
Not alwyas have minimum.
Applying to ERM
$$F = \frac{1}{n} \sum_n^{i=1} l(f(x_i), y_i)$$
Q:
- What does $\theta_t$ converge to?
- How should we set $\eta$?
A:
Use Quadratic functions as a case study for Gradient descent.
pick $F(\theta) = \frac{1}{2} \theta^TQ\theta - \theta^Tp$,
$Q$ is positive definite, and $p \in \mathbb{R}^d$.
What is the $arg \min_{\theta \in \mathbb{R}^P} F(\theta)$?
$\nabla F(\theta) = Q \theta -p = 0 \Rightarrow \theta_*= Q^{-1}p$
$Q$ is invertible due to it is positive definite. all its egenvalues are non negative, it has full rank, it is invertible.
What does Gradient Descent do?
$$
\begin{aligned}
\theta_{t+1} &= \theta_t - \eta \cdot \nabla F(\theta_t) \\
&=\theta_t-\eta \cdot [Q \cdot \theta_t-p] \\
\end{aligned}
$$
let delta t equal to the difference between theta t and theta star.
$\nabla_t = \theta_t - \theta_*$
$\nabla_{t+1} = \theta_{t+1} - \theta_*$
$\nabla_{t+1} = \theta_t-\eta \cdot [Q \cdot \theta_t-p] - \theta_*$
$\nabla_{t+1} = \theta_t-\eta \cdot Q[ \theta_t-Q^{-1} \cdot p] - \theta_*$
because $Q^{-1} \cdot p = \theta_*$
Thus, $\nabla_{t+1} = \theta_t - \theta_*-\eta \cdot Q[ \theta_t-\theta_*] $
$\nabla_{t+1} = \nabla_t-\eta \cdot Q[ \nabla_t] $
$\nabla_{t+1} = [I-\eta \cdot Q] \cdot \nabla_t $
Thus, we can get the equation:
$\nabla_{t} = [I-\eta \cdot Q]^t \cdot \nabla_0 $
What is the eigenvalues of $I-\eta \cdot Q$?
Let $\lambda_1 \geq \lambda_2 \geq … \geq \lambda_d$ are eigenvalues of Q
$\{ 1-\eta \lambda_i\}^d_{i=1}$ are the eigenvalue of $I-\eta \cdot Q$
We need $\{ 1-\eta \lambda_i\}^d_{i=1}$ always less than 1, then when $t$ increate, the $\nabla_{t}$ can grow exponentially small.
Now, if we need $ 1-\eta \lambda_i \geq 0$
Which means $\eta \leq \frac{1}{\lambda_i} \quad \forall i$
Now, we need to make $1-\eta \lambda_i$ as small as possible, we need $\eta$ as large as we can, a large step size as we can.
$\eta$ should saturates this constraint.
So, the largest eigenvalue is $1-\frac{min(\lambda_Q)}{max(\lambda_Q)}$
Thus, $eigval(I-\eta Q) \in [0, 1-\frac{min(\lambda_Q)}{max(\lambda_Q)}]$
Then, there is a define:
$\forall v, \quad ||(I-\eta Q)v|| \leq (1-\frac{min(\lambda_Q)}{max(\lambda_Q)})|v|$
Thus max single value of $(I-\eta Q) \leq 1-\frac{min(\lambda_Q)}{max(\lambda_Q)}$
Thus, $|\nabla_t| \leq (1-\frac{min(\lambda_Q)}{max(\lambda_Q)})^t \cdot |\nabla_0|$
$\nabla_0$ is inital error
Thus, as $t \mapsto \infty $, the error goes to zero.
GD converges to stationary point, $(\nabla F = 0)$
We are going to show $|\nabla F(\theta_t)| \mapsto 0$.
But, we need to pay attentation on local minimum and saddle point. we will not consider these point.
Assume $F$ is “L-smooth”:
$\forall u,v \in \mathbb{R}^d$
$F(v) \leq F(u)+\nabla F(u) \cdot (v-u)+\frac{L}{2} ||u-v||^2$
The gradient $\nabla F$ is L-lipschitz.
$|\nabla F(u) - \nabla F(v)| \leq L||u-v||$
Now, suppose that $F$ is “L-smooth”, and $F \geq 0$, for any $0 \lt \eta \leq \frac{1}{L}$.
GD satisfies:
$$
min_{t \in (0,…,T-1)}|\nabla F(\theta_t)|^2 \leq \frac{2F(\theta_0)}{\eta \cdot T}
$$
Proof:
Apply L-smooth
$$
F(\theta_{t+1}) \leq F(\theta_t)+\nabla F(\theta_t) \cdot (\theta_{t+1}-\theta_t)+\frac{L}{2} |\theta_{t+1}-\theta_t|^2
$$
We know that $\theta_{t+1}-\theta_t = -\eta \nabla F(\theta_t)$
Thus:
$$
F(\theta_{t+1}) \leq F(\theta_t)+\eta |\nabla F(\theta_t)|^2 +\frac{L \eta^2}{2} |\nabla F(\theta_t)|^2
$$
$\eta |\nabla F(\theta_t)|^2$ looks like descent on $F$
$\frac{L \eta^2}{2} |\nabla F(\theta_t)|^2$ looks like error of quadratic approx.
$$
F(\theta_{t+1}) \leq F(\theta_t)-\eta(1-\frac{L \eta}{2}) |\nabla F(\theta_t)|^2
$$
We need to set $1-\frac{L \eta}{2} \geq \frac{1}{2}$
Which means $\eta \leq \frac{1}{L}$
thus
$$
F(\theta_{t+1}) \leq F(\theta_t)-\frac{\eta}{2} |\nabla F(\theta_t)|^2
$$
This, means GD actually makes some progress every iter unless $\nabla F=0$
$$
F(\theta_{t+1}) \leq F(\theta_{t-1})-\frac{\eta}{2} |\nabla F(\theta_{t-1})|^2-\frac{\eta}{2} |\nabla F(\theta_t)|^2
$$
$$
F(\theta_{t}) \leq F(\theta_0) - \frac{\eta}{2} \sum^{t-1}_{i=0}|\nabla F(\theta_i)|^2
$$
$$
\sum^{t-1}_{i=0}|\nabla F(\theta_i)|^2 \leq \frac{2(F(\theta_0) - F(\theta_t))}{\eta}
$$
due to $F \geq 0$
$$
\sum^{t-1}_{i=0}|\nabla F(\theta_i)|^2 \leq \frac{2F(\theta_0)}{\eta}
$$
$$
\sum^{t-1}_{i=0}|\nabla F(\theta_i)|^2 \geq t \cdot min_{i \in (0, …,t-1)} | \nabla F(\theta_i)|^2
$$
Then:
$$
min_{i \in (0,…,T-1)}|\nabla F(\theta_i)|^2 \leq \frac{2F(\theta_0)}{\eta \cdot t}
$$
Stochastic GD(SGD)
GD on ERM:
$$
\nabla_\theta \hat{L_n}[\theta] = \frac{1}{\theta} \sum^n_{i=1}[L(f_\theta(x_i), y_i)]
$$
Computation complexity is $O(n)$
Build a stochastic approx to $\nabla_\theta \hat{L_n}[\theta]$, that only takes $O(1)$ computation.
$i_t \sim unif{1,…,n}$
$g_t$ is gradient estimate
$$
g_t = \nabla_\theta L(f_{\theta t}(x_{it}), y_{it})
$$
$$
\mathbb{E}[g_t|i_1,….,i_{t-1}] = \frac{1}{n} \sum^n_{i=1} \nabla_\theta L(f_{\theta t}(x_i), y_i) = \nabla_\theta \hat{L_n}[\theta_t]
$$
key modifications in practise
- mini-batching
- epochs + reshuffling
step size
GD: constant ($\eta = 1/2$) step size.
SGD: decaying step sizes(time-varying step size),
classical SGD theory: $\eta_t = \frac{1}{\sqrt{t}}$, $\eta_t = \frac{1}{t}$,