## Local-Linear Kernel Regression¶

We give a brief introduction to the mathematics of local-linear kernel regression followed by source code showing how to fit the bandwidth parameter using k-fold cross validation.

### Problem Setup¶

Assume we have data of the form $(y_1, x_1), (y_2, x_2),\ldots,(y_n, x_n)$ where $y_i$ are the response variables and $x_i$ are predictor variables.

Suppose we have a new test point $x$ for which we are trying to estimate the corresponding $y$ value. Local linear estimation assumes the relationship,

$$y = m(x) + u$$

where $u$ is the error.

Then we consider the Taylor series expansion about the test point $x$ for each of the data points $(y_i, x_i)$,

\begin{aligned} y_i &= m(x_i) + u_i \\ &\approx m(x) + (x_i - x) m'(x) + \ldots \text{(higher order terms)} + u_i\\ &= m(x) + (x_i - x) \beta(x) + u_i \end{aligned}

where in the above we renamed the gradient $m'(x)$ to the more familiar $\beta(x)$ as is commonly used in regression terminology.

Now if we ignore the higher order terms in the Taylor series, and treat $m'(x)$ and $\beta(x)$ as the parameters to fit, we get the familiar linear regression setup,

$$y_i = \beta_0 + (x_i - x) \beta_1 + u_i$$

So local linear kernel regression seeks to minimize,

\begin{aligned} &\min_{\{\beta_0,\beta_1\}}\sum_{i=1}^n \left[\underbrace{y_i - \beta_0 - (x_i - x) \beta_1}_{z_i} \right]^2 K_h(x_i, x)\\ &= \min_{\{\beta_0,\beta_1\}}\underbrace{ \begin{bmatrix} z_1 & z_1 & \dots & z_n \\ \end{bmatrix}}_{\mathbf{z}^T}\underbrace{ \begin{bmatrix} K_h(x_1, x) & & \\ & \ddots & \\ & & K_h(x_n, x) \end{bmatrix}}_{\mathbf{K(x)}} \underbrace{\begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_n \\ \end{bmatrix}}_{\mathbf{z}}\\ &= \min_{\{\beta_0,\beta_1\}} \mathbf{z}^T \mathbf{K(x)} \mathbf{z} \end{aligned}

with $K_h(x_i, x)$ as the kernel and $\mathbf{z}=\mathbf{y} - \mathbf{X}\mathbf{\beta}$.

To solve the minimization problem, we take first order conditions and set it to zero.

First we expand the expression,

\begin{aligned} \mathbf{z}^T \mathbf{K(x)} \mathbf{z} &= \left[\mathbf{y} - \mathbf{X}\mathbf{\beta}\right]^T\mathbf{K(x)} \left[\mathbf{y} - \mathbf{X}\mathbf{\beta}\right]\\ &=\left[\mathbf{y} - \mathbf{X}\mathbf{\beta}\right]^T\left[ \mathbf{K(x)}\mathbf{y} - \mathbf{K(x)}\mathbf{X}\mathbf{\beta} \right]\\ &=\mathbf{y}^T\mathbf{K(x)}\mathbf{y} -\mathbf{y}^T\mathbf{K(x)}\mathbf{X}\mathbf{\beta} - \beta^T \mathbf{X}^T\mathbf{K(x)}\mathbf{y} + \beta^T \mathbf{X}^T\mathbf{K(x)}\mathbf{X}\mathbf{\beta} \end{aligned}

Differentiating with respect to the vector $\beta$ (see notes on matrix differentiation here) gives,

\begin{aligned} \frac{d}{d\beta}\left[ \mathbf{z}^T \mathbf{K(x)} \mathbf{z} \right] &= - \mathbf{y}^T\mathbf{K(x)X} -\mathbf{y}^T\mathbf{K(x)X} + 2\mathbf{X}^T\mathbf{K(x)}\mathbf{X}\beta\\ &=-2\mathbf{y}^T\mathbf{K(x)X} + 2\mathbf{X}^T\mathbf{K(x)}\mathbf{X}\beta \end{aligned}

Setting $\frac{d}{d\beta}=0$ and solving for $\beta$ gives, \begin{aligned} \mathbf{y}^T\mathbf{K(x)X} &= \mathbf{X}^T\mathbf{K(x)}\mathbf{X}\beta\\ \beta&= \left(\mathbf{X}^T\mathbf{K(x)}\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{ K(x)y} \end{aligned}

So the optimal $\beta$ that minimizes the weighted squared error is given by the $\hat{\beta}=\left(\mathbf{X}^T\mathbf{K(x)}\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{ K(x)y}$.

Recall that,

$$\hat{\beta}=\begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \end{bmatrix}=\begin{bmatrix} \hat{m}(x) \\ \hat{m}'(x) \\ \end{bmatrix}\\$$

The predicted value for $y$ for test point $x$ is thus given by $\hat{m}(x)$.

### Python implementation¶

We now show the source code for the Python implementation of the above local-linear kernel regression. We use a sample test function and generate points from test function with noise added.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from   numpy.linalg import pinv
from   sklearn import model_selection
from   functools import partial

%matplotlib inline

In [2]:
def f(x):
return 3.*np.cos(x/2.) + x**2./5. + 3.

def kernel(x):
''' compute gaussian kernel '''
pdf_const = 1 / np.sqrt( 2. * np.pi )
return pdf_const * np.exp( -0.5 * x ** 2 )

In [3]:
def local_linear_regression( xs, ys, new_points, h ):
''' returns the local kernel weighted average for the set of points in new_points
xs, ys, are the input data. h is the bandwidth
'''

# the logic below is written for simplicity
predicted = []
for point in new_points:
# set up matrices for generalized weighted regression

X = np.vstack(( np.ones(len(xs)), xs - point )).T
Y = ys.T

# can replace this below with scipy sparse matrix if needed
W = np.diag( kernel( (xs - point) / h ) )

A = X.T.dot(W).dot(X)
B = X.T.dot(W).dot(Y)

# use pinv to handle degeneraccies
predicted.append( pinv(A).dot(B)[0] )

return np.array(predicted)

In [4]:
def k_fold_split( X, Y, k, test_size=0.1 ):
''' split the data into k parts '''

datasets = []
for train_idx, test_idx in model_selection.ShuffleSplit( k, test_size ).split(X):
datasets.append((X[train_idx], Y[train_idx], X[test_idx], Y[test_idx]))

return datasets

def k_fold_mse( xdata, ydata, k, h ):
''' perform k-fold cross-validation for given bandwidth value h and return the mse error '''
mse = []
for xtrain, ytrain, xtest, ytest in k_fold_split(xdata, ydata, k):
y_predicted = local_linear_regression(xtrain, ytrain, xtest, h)
mse.append( np.mean( ( ytest - y_predicted )**2 ) )
return np.mean( mse )

In [5]:
def run_optimize():
''' create test data, run optimization, and plot results '''
# create test data
xs = np.random.rand(100) * 10
ys = f(xs) + 2*np.random.randn( *xs.shape )
grid = np.r_[0:10:512j]

kfold = 10
hmin, hmax = 0.05, 2.5
hs = np.linspace( hmin, hmax, 200)

g = partial( k_fold_mse, xs, ys, kfold )
MSEs = map( g, hs )

idx_min = np.argmin( MSEs )

opt_h = hs[idx_min]
opt_mse = MSEs[idx_min]

predicted_ys = local_linear_regression( xs, ys, grid, opt_h )
title = 'Gaussian Kernel Regression - optimal (mse, bandwidth) = (%s, %s)'
title = title % ( opt_mse, opt_h )
plt.figure(figsize=(12,8))
plt.plot(grid, f(grid), 'r--', label='Reference', lw=2.0)
plt.plot(grid, predicted_ys, 'g--', label='Gaussian Kernel Regression', lw=2.0)
plt.plot(xs, ys, 'o', alpha=0.5, label='Data')
plt.title(title)
plt.legend(loc='best')
plt.show()

run_optimize()