Rademacher Complexity
Define
Let $F:X\mapsto\mathbb{R}$. The Rademacher Complexity $R_n(F)$ is defined as:
$$
R_n(F) = \underset{ \epsilon_i,x_i }{\mathbb{E}} \underset{f \in F}{sup} (n^{-1} \sum_{i=1}^n \epsilon_i f(x_i))
$$
- $n$ is data size
- sup(supremum) is least upper bound of a set; here we are trying to find the worest case function in the function class $F$.
$$
R_n(F) = \mathbb{E} \underset{f \in F}{sup} (n^{-1} \langle \epsilon, f \rangle)
$$
- $\epsilon \in (\epsilon_1, ….., \epsilon_n)$
- $f \in (f(x_1), ….., f(x_n))$
this quantity here is actually what governs uniform deviations between population and training values
Examples:
Two extremes
Worest case
$F$ = {all functions mapping $X \mapsto [-1, 1]$}.
$R_n(F)=1$, $[f(x_i) = \epsilon_i]$
best case
$F = {f}$, $F$ is single function.
$R_n(F)=0$
one normal example
The L2 norm (also called the Euclidean norm or L2 normal) is a way to measure the “length” or “magnitude” of a vector in Euclidean space. It is a specific instance of the more general Lp norms used to measure the size of objects in various mathematical spaces.
For a vector $x = (x_1, x_2,…,x_n)$, the L2 norm is defined as:
$|x|_2 = \sqrt{x_1^2+x_2^2+…+x_n^2}$
Now:
$F=\{X \mapsto <W, X> |\qquad |W|_2 \leq 1 \} \qquad W \in \mathbb{R}^d$
$ x_1, …x_n \underset{iid}{\sim} N(0,I)$ distribution with zero mean and identity covariance.
$$
\begin{aligned}
R_n(F) &= \underset{\epsilon_i, x_i}{\mathbb{E}} \underset{f \in F}{sup} (\frac{1}{n} \sum_{i=1}^n \epsilon_i f(x_i)) \\
&= \underset{\epsilon_i, x_i}{\mathbb{E}} \underset{\lvert w \rvert_2 \leq 1}{sup} (\frac{1}{n} \sum_{i=1}^n \epsilon_i <x_i, w>) \quad \text{(def of F)}\\
&=\underset{\epsilon_i, x_i}{\mathbb{E}} \underset{\lvert w \rvert_2 \leq 1}{sup} <(\frac{1}{n} \sum_{i=1}^n \epsilon_i x_i, w)> \quad (\text{linearity of } x \mapsto <x,w>)\\
&=\underset{\epsilon_i, x_i}{\mathbb{E}} \lvert \frac{1}{n} \sum_{i=1}^n \epsilon_i x_i \rvert_2 \quad (\text{dual normal characterization of } L_2: \underset{\lvert x \rvert_2 \leq 1}{sup} <x,q> = \lvert q \rvert_2)\\
&=\underset{\epsilon_i, x_i}{\mathbb{E}} \sqrt{( \lvert \frac{1}{n} \sum_{i=1}^n \epsilon_i x_i \rvert_2)^2}\\
&\leq \sqrt{\underset{\epsilon_i, x_i}{\mathbb{E}}( \lvert \frac{1}{n} \sum_{i=1}^n \epsilon_i x_i \rvert_2)^2} \quad \text{Jensent } \sqrt{} \text{ is concave}\\
&=\sqrt{\frac{1}{n^2} \sum_{i=1}^n\underset{x_i}{\mathbb{E}}( \lvert x_i \rvert)^2} \quad (\epsilon_i \perp \epsilon_j, i\neq j, \mathbb{E}(\epsilon_i) = 0)\\
&=\sqrt{\frac{d}{n}}\\
\end{aligned}
$$
Main punch line:
Let $F$ map $X \mapsto \mathbb{R}$, let $x_1$……$x_n$ are iid from $\mathbb{D}$.
$$
\mathbb{E}[\underset{f \in F}{sup}(\mathbb{E}[f(x)]- \frac{1}{n} \sum_{i=1}^n f(x_i))] \leq 2R_n(F)
$$
$$
\mathbb{E}[\underset{f \in F}{sup} \frac{1}{n} \sum_{i=1}^n (\mathbb{E}[f(x_i)]-f(x_i))] \leq 2R_n(F)
$$
- $\mathbb{E}[f(x)]$ is population level value
- $\frac{1}{n} \sum_{i=1}^n f(x_i)$ is sample level value
This is just a statement about sort of uniform deviations between population and training or sample averages versus population averages