Local-Linear Kernel Regression

We give a brief introduction to the mathematics of local-linear kernel regression followed by source code showing how to fit the bandwidth parameter using k-fold cross validation.

Problem Setup

Assume we have data of the form $(y_1, x_1), (y_2, x_2),\ldots,(y_n, x_n)$ where $y_i$ are the response variables and $x_i$ are predictor variables.

Suppose we have a new test point $x$ for which we are trying to estimate the corresponding $y$ value. Local linear estimation assumes the relationship,

$$ y = m(x) + u $$

where $u$ is the error.

Then we consider the Taylor series expansion about the test point $x$ for each of the data points $(y_i, x_i)$,

$$ \begin{aligned} y_i &= m(x_i) + u_i \\ &\approx m(x) + (x_i - x) m'(x) + \ldots \text{(higher order terms)} + u_i\\ &= m(x) + (x_i - x) \beta(x) + u_i \end{aligned} $$

where in the above we renamed the gradient $m'(x)$ to the more familiar $ \beta(x)$ as is commonly used in regression terminology.

Now if we ignore the higher order terms in the Taylor series, and treat $m'(x)$ and $ \beta(x)$ as the parameters to fit, we get the familiar linear regression setup,

$$ y_i = \beta_0 + (x_i - x) \beta_1 + u_i $$

So local linear kernel regression seeks to minimize,

$$ \begin{aligned} &\min_{\{\beta_0,\beta_1\}}\sum_{i=1}^n \left[\underbrace{y_i - \beta_0 - (x_i - x) \beta_1}_{z_i} \right]^2 K_h(x_i, x)\\ &= \min_{\{\beta_0,\beta_1\}}\underbrace{ \begin{bmatrix} z_1 & z_1 & \dots & z_n \\ \end{bmatrix}}_{\mathbf{z}^T}\underbrace{ \begin{bmatrix} K_h(x_1, x) & & \\ & \ddots & \\ & & K_h(x_n, x) \end{bmatrix}}_{\mathbf{K(x)}} \underbrace{\begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_n \\ \end{bmatrix}}_{\mathbf{z}}\\ &= \min_{\{\beta_0,\beta_1\}} \mathbf{z}^T \mathbf{K(x)} \mathbf{z} \end{aligned} $$

with $ K_h(x_i, x)$ as the kernel and $\mathbf{z}=\mathbf{y} - \mathbf{X}\mathbf{\beta}$.

To solve the minimization problem, we take first order conditions and set it to zero.

First we expand the expression,

$$ \begin{aligned} \mathbf{z}^T \mathbf{K(x)} \mathbf{z} &= \left[\mathbf{y} - \mathbf{X}\mathbf{\beta}\right]^T\mathbf{K(x)} \left[\mathbf{y} - \mathbf{X}\mathbf{\beta}\right]\\ &=\left[\mathbf{y} - \mathbf{X}\mathbf{\beta}\right]^T\left[ \mathbf{K(x)}\mathbf{y} - \mathbf{K(x)}\mathbf{X}\mathbf{\beta} \right]\\ &=\mathbf{y}^T\mathbf{K(x)}\mathbf{y} -\mathbf{y}^T\mathbf{K(x)}\mathbf{X}\mathbf{\beta} - \beta^T \mathbf{X}^T\mathbf{K(x)}\mathbf{y} + \beta^T \mathbf{X}^T\mathbf{K(x)}\mathbf{X}\mathbf{\beta} \end{aligned} $$

Differentiating with respect to the vector $\beta$ (see notes on matrix differentiation here) gives,

$$ \begin{aligned} \frac{d}{d\beta}\left[ \mathbf{z}^T \mathbf{K(x)} \mathbf{z} \right] &= - \mathbf{y}^T\mathbf{K(x)X} -\mathbf{y}^T\mathbf{K(x)X} + 2\mathbf{X}^T\mathbf{K(x)}\mathbf{X}\beta\\ &=-2\mathbf{y}^T\mathbf{K(x)X} + 2\mathbf{X}^T\mathbf{K(x)}\mathbf{X}\beta \end{aligned} $$

Setting $\frac{d}{d\beta}=0$ and solving for $\beta$ gives, $$ \begin{aligned} \mathbf{y}^T\mathbf{K(x)X} &= \mathbf{X}^T\mathbf{K(x)}\mathbf{X}\beta\\ \beta&= \left(\mathbf{X}^T\mathbf{K(x)}\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{ K(x)y} \end{aligned} $$

So the optimal $\beta$ that minimizes the weighted squared error is given by the $\hat{\beta}=\left(\mathbf{X}^T\mathbf{K(x)}\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{ K(x)y}$.

Recall that,

$$ \hat{\beta}=\begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \end{bmatrix}=\begin{bmatrix} \hat{m}(x) \\ \hat{m}'(x) \\ \end{bmatrix}\\ $$

The predicted value for $y$ for test point $x$ is thus given by $\hat{m}(x)$.

Python implementation

We now show the source code for the Python implementation of the above local-linear kernel regression. We use a sample test function and generate points from test function with noise added.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from   numpy.linalg import pinv
from   sklearn import model_selection
from   functools import partial

%matplotlib inline
In [2]:
def f(x):
    return 3.*np.cos(x/2.) + x**2./5. + 3.

def kernel(x):
    ''' compute gaussian kernel '''
    pdf_const = 1 / np.sqrt( 2. * np.pi )
    return pdf_const * np.exp( -0.5 * x ** 2 )
In [3]:
def local_linear_regression( xs, ys, new_points, h ):
    ''' returns the local kernel weighted average for the set of points in new_points
        xs, ys, are the input data. h is the bandwidth

    # the logic below is written for simplicity
    predicted = []
    for point in new_points:
        # set up matrices for generalized weighted regression

        X = np.vstack(( np.ones(len(xs)), xs - point )).T
        Y = ys.T

        # can replace this below with scipy sparse matrix if needed
        W = np.diag( kernel( (xs - point) / h ) )

        A = X.T.dot(W).dot(X)
        B = X.T.dot(W).dot(Y)

        # use pinv to handle degeneraccies
        predicted.append( pinv(A).dot(B)[0] )

    return np.array(predicted)
In [4]:
def k_fold_split( X, Y, k, test_size=0.1 ):
    ''' split the data into k parts '''

    datasets = []
    for train_idx, test_idx in model_selection.ShuffleSplit( k, test_size ).split(X):
        datasets.append((X[train_idx], Y[train_idx], X[test_idx], Y[test_idx]))

    return datasets

def k_fold_mse( xdata, ydata, k, h ):
    ''' perform k-fold cross-validation for given bandwidth value h and return the mse error '''
    mse = []
    for xtrain, ytrain, xtest, ytest in k_fold_split(xdata, ydata, k):
        y_predicted = local_linear_regression(xtrain, ytrain, xtest, h)
        mse.append( np.mean( ( ytest - y_predicted )**2 ) )
    return np.mean( mse )
In [5]:
def run_optimize():
    ''' create test data, run optimization, and plot results '''
    # create test data
    xs = np.random.rand(100) * 10
    ys = f(xs) + 2*np.random.randn( *xs.shape )
    grid = np.r_[0:10:512j]

    kfold = 10
    hmin, hmax = 0.05, 2.5
    hs = np.linspace( hmin, hmax, 200)

    g = partial( k_fold_mse, xs, ys, kfold )
    MSEs = map( g, hs )

    idx_min = np.argmin( MSEs )

    opt_h = hs[idx_min]
    opt_mse = MSEs[idx_min]

    predicted_ys = local_linear_regression( xs, ys, grid, opt_h )
    title = 'Gaussian Kernel Regression - optimal (mse, bandwidth) = (%s, %s)'
    title = title % ( opt_mse, opt_h )
    plt.plot(grid, f(grid), 'r--', label='Reference', lw=2.0)
    plt.plot(grid, predicted_ys, 'g--', label='Gaussian Kernel Regression', lw=2.0)
    plt.plot(xs, ys, 'o', alpha=0.5, label='Data')