Forward Stepwise Regression

We illustrate stepwise regression using a Auto mpg data-set from the UCI machine learning repository.

Let $\mathbf{X}$ be a $n\times p$ matrix of features and $\mathbf{y}$ a $n\times 1$ vector of reponse variables.

To make the problem slightly more interesting, assume we are only given $\mathbf{X}^T\mathbf{X}$ and $\mathbf{X}^T\mathbf{y}$, instead of the actual raw data.

We will show that we are still able to perform stepwise regression using a sensible criteria for choosing the next variable to add to the predictor list.

We also include logic to handling multi-collinearity and exclude variables that are highly collinear from inclusion into the current predictor set.

Linear Regression

Firstly we show the familiar problem setup for least squares linear regression, which is to minimize,

$$ \begin{aligned} &\min_{\{\beta_0,\beta_1\}}\sum_{i=1}^n \left[\underbrace{y_i - \beta_0 - x_{1i} \beta_1 - x_{2i} \beta_2 - \cdots - x_{1p} \beta_p }_{z_i} \right]^2 \\ &= \min_{\{\beta_0,\beta_1\}} \left( \mathbf{y} - \mathbf{X}\mathbf{\beta} \right)^T \left( \mathbf{y} - \mathbf{X}\mathbf{\beta} \right) \end{aligned} $$

Taking first order conditions with respect to the vector $\beta$ and setting it to zero, we get,

$$ \begin{aligned} \frac{d}{d\beta}\left[\left( \mathbf{y} - \mathbf{X}\mathbf{\beta} \right)^T \left( \mathbf{y} - \mathbf{X}\mathbf{\beta} \right)\right] &= 0\\ -2\mathbf{y}^T\mathbf{X} + 2 \mathbf{X}^T\mathbf{X}\mathbf{\beta}&=0\\ \mathbf{\beta} = \left(\mathbf{X}^T\mathbf{X}\right)^{-1} \mathbf{X}^T\mathbf{y} \end{aligned} $$

Now we know that the "hat matrix" $\mathbf{H}$, "puts the hat on $\mathbf{y}$", meaning, $$ \hat{\mathbf{y}}=\mathbf{X\beta} = \underbrace{\mathbf{X}\left(\mathbf{X}^T\mathbf{X}\right)^{-1} \mathbf{X}^T}_{\mathbf{H}}\mathbf{y} $$

Given that we need to work under the constraint that we are only given $\mathbf{A} = \mathbf{X}^T\mathbf{X}$ and $\mathbf{B}=\mathbf{X}^T\mathbf{y}$, the only selection criteria we are able to use is to consider an underdetermined version of the SSE.

We know that the SSE is given by,

$$ \begin{aligned} \text{SSE} &= \left(\mathbf{y} - \hat{\mathbf{y}}\right)^T\left(\mathbf{y} - \hat{\mathbf{y}}\right)\\ &= \mathbf{y}^T \mathbf{y} -\mathbf{y}^T \mathbf{Hy}- \mathbf{y}^T\mathbf{H}^T\mathbf{y}+ \mathbf{y}^T\underbrace{\mathbf{H}^T\mathbf{H}}_{\mathbf{H}}\mathbf{y}\\ &=\mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{H}^T\mathbf{y}\\ &=\mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{H}\mathbf{y}\\ &=\mathbf{y}^T\mathbf{y} - \mathbf{B}^T\mathbf{A}^{-1}\mathbf{B} \end{aligned} $$

Since we have no information to compute $\mathbf{y}^T\mathbf{y}$, our SSE is underdetermined. Nevertheless, we still have information to calculate the term $\mathbf{B}^T\mathbf{A}^{-1}\mathbf{B}$ under varying number of predictors when we perform our stepwise regression. Another point to note is that $\mathbf{y}^T\mathbf{y}$ is common to all sets of regressions we will be running, so it is effectively a constant term.

Handling $k\lt p$ Predictors

Say in our current pass of the stepwise regression, we only have $k\lt p$ predictors. However, our data is always $\mathbf{A} = \mathbf{X}^T\mathbf{X}$, which is a $n\times n$ matrix, and $\mathbf{B}=\mathbf{X}^T\mathbf{y}$, which is a $p\times 1$ matrix. To obtain the reduced matrices for our regression involving only $k$ predictors, we delete all the rows and columns in $\mathbf{A}$ corresponding to the un-used predictors. For $\mathbf{B}$, we delete the rows corresponding predictors which are currently omitted.

With the above, we can then go ahead and perform the regression with the reduced set of parameters.

In [2]:
import pandas as pd
import numpy as np
from   numpy.linalg import pinv, inv, eig
from   sklearn.preprocessing import normalize

def regress_subset( A, B, indices ):
    ''' A : (X'X)
        B : X'y
        indices : list of index for subset of features we want to fit.
        Given A anb B above, return OLS result on only a subset of the data specified by indices.

    indices = sorted( [0] + indices )

    reduced_A = np.take( A, indices, axis=1 )
    reduced_A = np.take( reduced_A, indices, axis=0 )
    reduced_B = np.take( B, indices)

    betas   = pinv(reduced_A).dot(reduced_B)

    # given the constraint that we only have A=(X'X), B=X'y our scoring function
    # is just B'A^-1 B, which is an underdetermined residual sum of squares.
    score   =

    if len( indices ) > 2: # only apply when we have more than 1 feature
                             # 2 in above is because there is a const term.
        w, _    = eig( normalize(reduced_A) )
        cond_no = np.sqrt( max(w) / min(w) )
        cond_no = 0.0

    return score, betas, cond_no

def stepwise_regress( A, B, max_features ):

    MAX_COND_NO = 15.0

    existing_indicies = []
    collinear_indicies = []
    done = False

    all_indices = range(1, len(A))

    while not len(existing_indicies) == max_features:
        best_score = 0
        best_idx = None

        candidate_indices = set(all_indices).difference(existing_indicies)
        candidate_indices = candidate_indices.difference(collinear_indicies)
        if not candidate_indices:

        for idx in candidate_indices:
            cur_indices = existing_indicies + [idx]
            score, betas, cond_no = regress_subset( A, B, cur_indices )
            if cond_no > MAX_COND_NO:
            elif score > best_score:
                best_score = score
                best_idx = idx

        if best_idx is not None:

    return existing_indicies

def readData():

    header=['mpg','cylinders', 'displacement', 'horsepower', 'weight',
             'acceleration', 'model_year', 'origin', 'car_name']

    df = pd.read_csv( '', sep='\s+', names=header)
    df = df[['mpg', 'displacement', 'horsepower', 'weight', 'acceleration']]

    return df

def main():

    df = readData()
    records = np.array( df.to_records(index=False).tolist())

    Y = records[:, 0]
    X = records[:, 1:]
    X = np.column_stack((np.ones(len(records)), X))

    A =
    B =

    features_index = stepwise_regress( A, B, 3 )
    _, betas, _ = regress_subset(A, B, features_index)
    print 'The selected best features are', features_index
    print 'The regression betas are', betas

The selected best features are [3]
The regression betas are [  4.62165245e+01  -7.64734254e-03]

The mpg data is highly multi-collinear, and in this case, our stepwise regression has chosen only one single predictor variable. If we run a full regression using all predictors and study the results, we can see that the condition number of the data matrix is extremely high.