Rademacher Complexity

Define

Let $F:X\mapsto\mathbb{R}$. The Rademacher Complexity $R_n(F)$ is defined as:

$$
R_n(F) = \underset{ \epsilon_i,x_i }{\mathbb{E}} \underset{f \in F}{sup} (n^{-1} \sum_{i=1}^n \epsilon_i f(x_i))
$$

$n$ is data size
sup(supremum) is least upper bound of a set; here we are trying to find the worest case function in the function class $F$.

$$
R_n(F) = \mathbb{E} \underset{f \in F}{sup} (n^{-1} \langle \epsilon, f \rangle)
$$

$\epsilon \in (\epsilon_1, ….., \epsilon_n)$
$f \in (f(x_1), ….., f(x_n))$

this quantity here is actually what governs uniform deviations between population and training values

Examples:

Two extremes

Worest case

$F$ = {all functions mapping $X \mapsto [-1, 1]$}.

$R_n(F)=1$, $[f(x_i) = \epsilon_i]$
best case

$F = {f}$, $F$ is single function.

$R_n(F)=0$

one normal example

The L2 norm (also called the Euclidean norm or L2 normal) is a way to measure the “length” or “magnitude” of a vector in Euclidean space. It is a specific instance of the more general Lp norms used to measure the size of objects in various mathematical spaces.
For a vector $x = (x_1, x_2,…,x_n)$, the L2 norm is defined as:
$|x|_2 = \sqrt{x_1^2+x_2^2+…+x_n^2}$

Now:

$F=\{X \mapsto <W, X> |\qquad |W|_2 \leq 1 \} \qquad W \in \mathbb{R}^d$
$ x_1, …x_n \underset{iid}{\sim} N(0,I)$ distribution with zero mean and identity covariance.

$$
\begin{aligned}
R_n(F) &= \underset{\epsilon_i, x_i}{\mathbb{E}} \underset{f \in F}{sup} (\frac{1}{n} \sum_{i=1}^n \epsilon_i f(x_i)) \\
&= \underset{\epsilon_i, x_i}{\mathbb{E}} \underset{\lvert w \rvert_2 \leq 1}{sup} (\frac{1}{n} \sum_{i=1}^n \epsilon_i <x_i, w>) \quad \text{(def of F)}\\
&=\underset{\epsilon_i, x_i}{\mathbb{E}} \underset{\lvert w \rvert_2 \leq 1}{sup} <(\frac{1}{n} \sum_{i=1}^n \epsilon_i x_i, w)> \quad (\text{linearity of } x \mapsto <x,w>)\\
&=\underset{\epsilon_i, x_i}{\mathbb{E}} \lvert \frac{1}{n} \sum_{i=1}^n \epsilon_i x_i \rvert_2 \quad (\text{dual normal characterization of } L_2: \underset{\lvert x \rvert_2 \leq 1}{sup} <x,q> = \lvert q \rvert_2)\\
&=\underset{\epsilon_i, x_i}{\mathbb{E}} \sqrt{( \lvert \frac{1}{n} \sum_{i=1}^n \epsilon_i x_i \rvert_2)^2}\\
&\leq \sqrt{\underset{\epsilon_i, x_i}{\mathbb{E}}( \lvert \frac{1}{n} \sum_{i=1}^n \epsilon_i x_i \rvert_2)^2} \quad \text{Jensent } \sqrt{} \text{ is concave}\\
&=\sqrt{\frac{1}{n^2} \sum_{i=1}^n\underset{x_i}{\mathbb{E}}( \lvert x_i \rvert)^2} \quad (\epsilon_i \perp \epsilon_j, i\neq j, \mathbb{E}(\epsilon_i) = 0)\\
&=\sqrt{\frac{d}{n}}\\
\end{aligned}
$$

Main punch line:

Let $F$ map $X \mapsto \mathbb{R}$, let $x_1$……$x_n$ are iid from $\mathbb{D}$.
$$
\mathbb{E}[\underset{f \in F}{sup}(\mathbb{E}[f(x)]- \frac{1}{n} \sum_{i=1}^n f(x_i))] \leq 2R_n(F)
$$
$$
\mathbb{E}[\underset{f \in F}{sup} \frac{1}{n} \sum_{i=1}^n (\mathbb{E}[f(x_i)]-f(x_i))] \leq 2R_n(F)
$$

$\mathbb{E}[f(x)]$ is population level value
$\frac{1}{n} \sum_{i=1}^n f(x_i)$ is sample level value

This is just a statement about sort of uniform deviations between population and training or sample averages versus population averages

9/12 miss leacture note!!!!!