Influence Functions

Q: which trainning data points are the most influential for our model?

Suppose a data set $\mathbb{D}$

$$\mathbb{D} \underset{ERM}{\to} \hat{f}_\theta$$
$x_i \in \mathbb{D}$

$$\mathbb{D}’ \underset{ERM}{\to} \hat{f}_{\hat{\theta}}$$
$x_i \notin \mathbb{D}’$

Notions:

$$
\begin{aligned}
& \mathbb{D} = \{(x_i, y_i)\}_{i=1}^n \quad[\text{full trainning data}] \\
& \mathbb{D}^{\neg i} = \{(x_j, y_j)\}_{j \neq i }^n \quad[\text{leave ith datapoint out dataset}]
\end{aligned}
$$

$$
\begin{aligned}
& \hat{ \theta}_{n} \in \underset{\theta \in \Theta}{argmin} \underset{z \sim D}{\hat{\mathbb{E}}} [l(z, \theta)] = \frac{1}{n} \sum_{i=1}^n \left[l(z_i, \theta) \right] \\
& \hat{\theta}^{\neg i } \in \underset{\theta \in \Theta}{argmin} \underset{z \sim D^{\neg i}}{\hat{\mathbb{E}}}[l(z, \theta)] = \frac{1}{n-1} \sum_{j \neq i} \left[l(z_i, \theta) \right]
\end{aligned}
$$


Define:
$\epsilon \in R, z = (x,y)$

$$
\hat{\theta}_{\epsilon, z} \in \underset{\theta \in \Theta}{argmin} \frac{1}{n} \sum_{i=1}^n l(z_i, \theta) + \epsilon \cdot l(z, \theta)
$$

  • the first part is origin loss function: $\frac{1}{n} \sum_{i=1}^n l(z_i, \theta) = \underset{\hat{z} \sim D}{\hat{\mathbb{E}}} l(\hat{z}, \theta)$
    $$\hat{\theta}_{0, z} = \hat{\theta}_n$$
    $$\hat{\theta}_{1/n, z_i} = \hat{\theta}^{\neg i}$$

First order Taylor expansion:
$$\hat{\theta}^{\neg i} = \hat{\theta_n} - \frac{1}{n} \cdot \frac{\partial \hat{\theta}_{\epsilon,z_i}}{\partial \epsilon}|_{\epsilon=0}$$
$$h(\epsilon) = \hat{\theta}_{\epsilon, z_i}$$
$$\begin{aligned}
h(-\frac{1}{n}) &= h(0)+h’(0)\cdot [-\frac{1}{n}] + 0(\frac{1}{n^2}) \\
& \approx h(0) -\frac{1}{n} h’(0)
\end{aligned}
$$
$$
\begin{aligned}
h(-\frac{1}{n}) &= h(0)+h’(0)\cdot [-\frac{1}{n}] + 0(\frac{1}{n^2}) \\
& \approx h(0) -\frac{1}{n} h’(0)
\end{aligned}
$$
$h(-\frac{1}{n}) = \hat{\theta}_{-\frac{1}{n},z_i} = \hat{\theta}^{\neg i}$
$h(0) = \hat{\theta}_{0,z_i} = \hat{\theta}_n$

Prop

suppose that $\hat{\theta}_n$ is a strict optimizer.
Suppose $l(z,t)$ is twice differentiable.
$$
\frac{\partial \hat{\theta}_{\epsilon,z_i}}{\partial \epsilon}|_{\epsilon=0} = -H(\hat{\theta_n})^{-1} \nabla_\theta l(z, \hat{\theta}_n)
$$

$$
H(\theta) = \frac{1}{n} \sum_{i=1}^n \nabla_\theta^2 l(z_i, \theta)
$$

proof

$\hat{\theta}_{\epsilon,z_i}$ is an optimal solution to $\frac{1}{n} \sum_{i=1}^n l(z_i, \theta) + \epsilon \cdot l(z, \theta)$

Thus, the gradient of that equation is 0
$$
0 = \frac{1}{n} \sum_{i=1}^n \nabla_\theta l(z_i, \hat{\theta}_{\epsilon,z}) + \epsilon \cdot \nabla_\theta l(z, \hat{\theta}_{\epsilon,z})
$$
We just suppose $F(\epsilon) = \frac{1}{n} \sum_{i=1}^n \nabla_\theta l(z_i, \hat{\theta}_{\epsilon,z}) + \epsilon \cdot \nabla_\theta l(z, \hat{\theta}_{\epsilon,z})$

we know that if $0 = F(\epsilon)$, then $0 = \frac{\partial}{\partial \epsilon} F(\epsilon)$

Thus,:
$$
0 = \frac{1}{n} \sum_{i=1}^n \frac{\partial}{\partial \epsilon} \nabla_\theta l(z_i, \hat{\theta}_{\epsilon,z}) + \frac{\partial}{\partial \epsilon} \left[\epsilon \cdot \nabla_\theta l(z, \hat{\theta}_{\epsilon,z}) \right]
$$

Then we can deal with first part:
$$
\frac{\partial}{\partial \epsilon} \nabla_\theta l(z_i, \hat{\theta}_{\epsilon,z}) = \nabla_\theta^2 l(z_i, \hat{\theta}_{\epsilon,z}) \cdot \frac{\partial}{\partial \epsilon} \hat{\theta}_{\epsilon,z}
$$

Note:
$\frac{\partial}{\partial \epsilon} \hat{\theta}_{\epsilon,z}$ is the derivative of the solution curve!! that’s good~

Then we deal with the second term:
$$
\frac{\partial}{\partial \epsilon} \left[\epsilon \cdot \nabla_\theta l(z, \hat{\theta}_{\epsilon,z}) \right] = \epsilon \cdot \frac{\partial}{\partial \epsilon} \cdot \nabla_\theta l(z, \hat{\theta}_{\epsilon,z}) + \nabla_\theta l(z, \hat{\theta}_{\epsilon,z})
$$

Now put the first term and second term together:
$$
0 = \left[ \frac{1}{n} \sum_{i=1}^n \nabla_\theta^2 l(z_i, \hat{\theta}_{\epsilon,z}) \right] \cdot \frac{\partial}{\partial \epsilon} \hat{\theta}_{\epsilon,z} + \epsilon \cdot \frac{\partial}{\partial \epsilon} \cdot \nabla_\theta l(z, \hat{\theta}_{\epsilon,z}) + \nabla_\theta l(z, \hat{\theta}_{\epsilon,z})
$$

Evaluating RHS at $\epsilon = 0$:
$$
0 = \left[ \frac{1}{n} \sum_{i=1}^n \nabla_\theta^2 l(z_i, \hat{\theta}_{\epsilon,z}) \right] \cdot \frac{\partial}{\partial \epsilon} \hat{\theta}_{\epsilon,z} |_0 + \nabla_\theta l(z, \hat{\theta}_{\epsilon,z})
$$

$0 = Hx+q$
$x = -H^{-1}q$

Thus:
$$
\frac{\partial}{\partial \epsilon} \hat{\theta}_{\epsilon,z} |_{\epsilon=0} = -\left[ \frac{1}{n} \sum_{i=1}^n \nabla_\theta^2 l(z_i, \hat{\theta}_{\epsilon,z}) \right]^{-1} \nabla_\theta l(z, \hat{\theta}_{\epsilon,z})
$$


We have an approximate formula:
$$
\begin{aligned}
\hat{\theta}_{\epsilon, z} & \approx \hat{\theta_n} - \frac{1}{n} \frac{\partial}{\partial \epsilon} \hat{\theta}_{\epsilon,z} |_{\epsilon=0} \\
&= \hat{\theta_n} + \frac{1}{n} H(\hat{\theta_n})^{-1} \nabla_\theta \cdot l(z_i, \hat{\theta_n})
\end{aligned}
$$

The above approx come from Taylor expansion!!

Now we have a function $F$:
$$
F:\Theta \to R
$$

$$
F(\hat{\theta}^{\neg i}) = F(\hat{\theta}_n) + <\nabla F(\hat{\theta}_n), \hat{\theta}^{\neg i}- \hat{\theta}_n> + \text{higher order term}
$$

plugin above approx equation about $\hat{\theta}_{\epsilon,z}$
$$
F(\hat{\theta}^{\neg i}) = F(\hat{\theta}_n) + <\nabla F(\hat{\theta}_n), \frac{1}{n} H(\hat{\theta}_n)^{-1} \nabla_\theta \cdot l(z_i, \hat{\theta}_n)> + \text{higher order term}
$$

due to the def:
$$
F(\theta) = l(z, \theta)
$$
Thus, if $z = z_i$
$$
l(z, \hat{\theta}^{\neg i}) \approx l(z, \hat{\theta}_n) + \frac{1}{n} <\nabla_\theta l(z, \hat{\theta}_n), H(\hat{\theta}_n)^{-1} \nabla_\theta \cdot l(z_i, \hat{\theta}_n)>
$$

We need to know that :

$$
l(z_i, \hat{\theta}^{\neg i}) \geq l(z_i, \hat{\theta}_n)
$$

  • the left term doesn’t have $z_i$
  • the right term has $z_i$

There are still some issue in influence functions:

  1. Hessians function are expected to computer and store $O(d^2)$
  2. strick sptimize problem!!

Loss some notes here, last 25mins, about fixing these two problem!!!