## Forward Stepwise Regression¶

We illustrate stepwise regression using a Auto mpg data-set from the UCI machine learning repository.

Let $\mathbf{X}$ be a $n\times p$ matrix of features and $\mathbf{y}$ a $n\times 1$ vector of reponse variables.

To make the problem slightly more interesting, assume we are only given $\mathbf{X}^T\mathbf{X}$ and $\mathbf{X}^T\mathbf{y}$, instead of the actual raw data.

We will show that we are still able to perform stepwise regression using a sensible criteria for choosing the next variable to add to the predictor list.

We also include logic to handling multi-collinearity and exclude variables that are highly collinear from inclusion into the current predictor set.

### Linear Regression¶

Firstly we show the familiar problem setup for least squares linear regression, which is to minimize,

\begin{aligned} &\min_{\{\beta_0,\beta_1\}}\sum_{i=1}^n \left[\underbrace{y_i - \beta_0 - x_{1i} \beta_1 - x_{2i} \beta_2 - \cdots - x_{1p} \beta_p }_{z_i} \right]^2 \\ &= \min_{\{\beta_0,\beta_1\}} \left( \mathbf{y} - \mathbf{X}\mathbf{\beta} \right)^T \left( \mathbf{y} - \mathbf{X}\mathbf{\beta} \right) \end{aligned}

Taking first order conditions with respect to the vector $\beta$ and setting it to zero, we get,

\begin{aligned} \frac{d}{d\beta}\left[\left( \mathbf{y} - \mathbf{X}\mathbf{\beta} \right)^T \left( \mathbf{y} - \mathbf{X}\mathbf{\beta} \right)\right] &= 0\\ -2\mathbf{y}^T\mathbf{X} + 2 \mathbf{X}^T\mathbf{X}\mathbf{\beta}&=0\\ \mathbf{\beta} = \left(\mathbf{X}^T\mathbf{X}\right)^{-1} \mathbf{X}^T\mathbf{y} \end{aligned}

Now we know that the "hat matrix" $\mathbf{H}$, "puts the hat on $\mathbf{y}$", meaning, $$\hat{\mathbf{y}}=\mathbf{X\beta} = \underbrace{\mathbf{X}\left(\mathbf{X}^T\mathbf{X}\right)^{-1} \mathbf{X}^T}_{\mathbf{H}}\mathbf{y}$$

Given that we need to work under the constraint that we are only given $\mathbf{A} = \mathbf{X}^T\mathbf{X}$ and $\mathbf{B}=\mathbf{X}^T\mathbf{y}$, the only selection criteria we are able to use is to consider an underdetermined version of the SSE.

We know that the SSE is given by,

\begin{aligned} \text{SSE} &= \left(\mathbf{y} - \hat{\mathbf{y}}\right)^T\left(\mathbf{y} - \hat{\mathbf{y}}\right)\\ &= \mathbf{y}^T \mathbf{y} -\mathbf{y}^T \mathbf{Hy}- \mathbf{y}^T\mathbf{H}^T\mathbf{y}+ \mathbf{y}^T\underbrace{\mathbf{H}^T\mathbf{H}}_{\mathbf{H}}\mathbf{y}\\ &=\mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{H}^T\mathbf{y}\\ &=\mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{H}\mathbf{y}\\ &=\mathbf{y}^T\mathbf{y} - \mathbf{B}^T\mathbf{A}^{-1}\mathbf{B} \end{aligned}

Since we have no information to compute $\mathbf{y}^T\mathbf{y}$, our SSE is underdetermined. Nevertheless, we still have information to calculate the term $\mathbf{B}^T\mathbf{A}^{-1}\mathbf{B}$ under varying number of predictors when we perform our stepwise regression. Another point to note is that $\mathbf{y}^T\mathbf{y}$ is common to all sets of regressions we will be running, so it is effectively a constant term.

### Handling $k\lt p$ Predictors¶

Say in our current pass of the stepwise regression, we only have $k\lt p$ predictors. However, our data is always $\mathbf{A} = \mathbf{X}^T\mathbf{X}$, which is a $n\times n$ matrix, and $\mathbf{B}=\mathbf{X}^T\mathbf{y}$, which is a $p\times 1$ matrix. To obtain the reduced matrices for our regression involving only $k$ predictors, we delete all the rows and columns in $\mathbf{A}$ corresponding to the un-used predictors. For $\mathbf{B}$, we delete the rows corresponding predictors which are currently omitted.

With the above, we can then go ahead and perform the regression with the reduced set of parameters.

In [2]:
import pandas as pd
import numpy as np
from   numpy.linalg import pinv, inv, eig
from   sklearn.preprocessing import normalize

def regress_subset( A, B, indices ):
''' A : (X'X)
B : X'y
indices : list of index for subset of features we want to fit.
Given A anb B above, return OLS result on only a subset of the data specified by indices.

'''
indices = sorted( [0] + indices )

reduced_A = np.take( A, indices, axis=1 )
reduced_A = np.take( reduced_A, indices, axis=0 )
reduced_B = np.take( B, indices)

betas   = pinv(reduced_A).dot(reduced_B)

# given the constraint that we only have A=(X'X), B=X'y our scoring function
# is just B'A^-1 B, which is an underdetermined residual sum of squares.
score   = reduced_B.T.dot(pinv(reduced_A)).dot(reduced_B)

if len( indices ) > 2: # only apply when we have more than 1 feature
# 2 in above is because there is a const term.
w, _    = eig( normalize(reduced_A) )
cond_no = np.sqrt( max(w) / min(w) )
else:
cond_no = 0.0

return score, betas, cond_no

def stepwise_regress( A, B, max_features ):

MAX_COND_NO = 15.0

existing_indicies = []
collinear_indicies = []
done = False

all_indices = range(1, len(A))

while not len(existing_indicies) == max_features:
best_score = 0
best_idx = None

candidate_indices = set(all_indices).difference(existing_indicies)
candidate_indices = candidate_indices.difference(collinear_indicies)
if not candidate_indices:
break

for idx in candidate_indices:
cur_indices = existing_indicies + [idx]
score, betas, cond_no = regress_subset( A, B, cur_indices )
if cond_no > MAX_COND_NO:
collinear_indicies.append(idx)
elif score > best_score:
best_score = score
best_idx = idx

if best_idx is not None:
existing_indicies.append(best_idx)

return existing_indicies

'acceleration', 'model_year', 'origin', 'car_name']

df = df[['mpg', 'displacement', 'horsepower', 'weight', 'acceleration']]

return df

def main():

records = np.array( df.to_records(index=False).tolist())

Y = records[:, 0]
X = records[:, 1:]
X = np.column_stack((np.ones(len(records)), X))

A = X.T.dot(X)
B = X.T.dot(Y)

features_index = stepwise_regress( A, B, 3 )
_, betas, _ = regress_subset(A, B, features_index)
print 'The selected best features are', features_index
print 'The regression betas are', betas

main()

The selected best features are [3]
The regression betas are [  4.62165245e+01  -7.64734254e-03]


The mpg data is highly multi-collinear, and in this case, our stepwise regression has chosen only one single predictor variable. If we run a full regression using all predictors and study the results, we can see that the condition number of the data matrix is extremely high.