Model Pipelines in Sklearn

Model Pipelines in Sklearn#

Main Idea

Version control is a system that tracks changes to files over time.

# This code spits out lots of warnings. We turn them off for the purposes of this tutorial. 
# This is bad practice in general. Only use if you already know your code is correct. 
import warnings
import os
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
os.environ['PYTHONWARNINGS']='ignore'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t

!pip install -U scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: scikit-learn in /home/dane2/.local/lib/python3.9/site-packages (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /home/dane2/.local/lib/python3.9/site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: numpy>=1.17.3 in /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/site-packages (from scikit-learn) (1.21.5)
Requirement already satisfied: threadpoolctl>=2.0.0 in /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: scipy>=1.3.2 in /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/site-packages (from scikit-learn) (1.7.3)

import sklearn
assert sklearn.__version__ > '1.2'

from sklearn import set_config
set_config(transform_output = "pandas")

The Data#

df = pd.read_csv('/zfs/citi/workshop_data/python_ml/ames_train.csv')
df.shape

(1460, 81)

features = pd.read_csv('/zfs/citi/workshop_data/python_ml/ames_features.csv')
features

	variable	type	description
0	SalePrice	numeric	the property's sale price in dollars. This is ...
1	MSSubClass	categorical	The building class
2	MSZoning	categorical	The general zoning classification
3	LotFrontage	numeric	Linear feet of street connected to property
4	LotArea	numeric	Lot size in square feet
...	...	...	...
75	MiscVal	numeric	$Value of miscellaneous feature
76	MoSold	numeric	Month Sold
77	YrSold	numeric	Year Sold
78	SaleType	categorical	Type of sale
79	SaleCondition	categorical	Condition of sale

80 rows × 3 columns

Exploratory Analysis#

# it's always a good idea to look at your response variable
df.hist('SalePrice', bins=30)

array([[<AxesSubplot:title={'center':'SalePrice'}>]], dtype=object)

../_images/4539219f02d3a718d72045403a678ef13e3e33fa3b7f0dc21f492349bb9b2eeb.png

It wouldn’t be a bad idea to apply Box-Cox if using a linear model.

# we should make sure that no SalePrice values are missing
df.SalePrice.isna().any()

False

# it's also important to look at the degree of missingness in each of your features
column_missingness = df.isna().sum().sort_values(ascending=False) / len(df)
column_missingness.head(15)

PoolQC          0.995205
MiscFeature     0.963014
Alley           0.937671
Fence           0.807534
FireplaceQu     0.472603
LotFrontage     0.177397
GarageYrBlt     0.055479
GarageCond      0.055479
GarageType      0.055479
GarageFinish    0.055479
GarageQual      0.055479
BsmtFinType2    0.026027
BsmtExposure    0.026027
BsmtQual        0.025342
BsmtCond        0.025342
dtype: float64

Let’s go ahead and drop PoolQC, MiscFeature, and Alley as these are almost always missing. Note that some of these should be treated with more nuance. Take Fence for instance. I expect Fence takes the value NA when no fence is present.

features_to_drop = ['PoolQC', 'MiscFeature', 'Alley']
df = df.drop(features_to_drop, axis=1)
features = features[~features.variable.isin(features_to_drop)]
df.shape, features.shape

((1460, 78), (77, 3))

# Let's see if any rows are missing a large portion of their data
row_missingness = df.isna().sum(axis=1).sort_values(ascending=False) / len(df.columns)
row_missingness

    0.153846
   0.153846
   0.153846
  0.153846
  0.153846
          ...   
   0.000000
    0.000000
  0.000000
   0.000000
   0.000000
Length: 1460, dtype: float64

# let's take a closer look at some of these
index=39
df.loc[index, df.loc[index].isna()]

BsmtQual        NaN
BsmtCond        NaN
BsmtExposure    NaN
BsmtFinType1    NaN
BsmtFinType2    NaN
FireplaceQu     NaN
GarageType      NaN
GarageYrBlt     NaN
GarageFinish    NaN
GarageQual      NaN
GarageCond      NaN
Fence           NaN
Name: 39, dtype: object

Nothing too concerning here. Most of these features are very niche.

# get lists of categorical and numeric features
categorical_features = features.variable[features.type=='categorical'].tolist()
numeric_features = features.variable[features.type=='numeric'].tolist()[1:]
len(categorical_features), len(numeric_features)

(42, 34)

# let's take a look at our categorical features
categorical_features

['MSSubClass',
 'MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'Fireplaces',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'Fence',
 'SaleType',
 'SaleCondition']

# Let's look at the distribution of features
plot_features = ['Neighborhood', 'HouseStyle', 'GarageType', 'SaleType']
fig, axes = plt.subplots(2,2)
fig.set_size_inches(10,6)
axes = axes.flatten()
for ix, feature in enumerate(plot_features):
    df[feature].value_counts().plot(kind='bar', ax=axes[ix])
    axes[ix].set_title(feature)
plt.tight_layout()

../_images/db4df86a8f82b9012b23ae3e16db532baf7c522da1fe59c1578443ffda1feaae.png

# let's take a look at our categorical features
numeric_features

['LotFrontage',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold',
 'YrSold']

# since these are numeric, we can compute some basic stats:
df[numeric_features].describe()

	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	BsmtUnfSF	...	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold
count	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1452.000000	1460.000000	1460.000000	1460.000000	...	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.685262	443.639726	46.549315	567.240411	...	472.980137	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753
std	24.284752	9981.264932	1.382997	1.112799	30.202904	20.645407	181.066207	456.098091	161.319273	441.866955	...	213.804841	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095
min	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000
25%	59.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	0.000000	223.000000	...	334.500000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000
50%	69.000000	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	0.000000	477.500000	...	480.000000	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000
75%	80.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	166.000000	712.250000	0.000000	808.000000	...	576.000000	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000
max	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	1474.000000	2336.000000	...	1418.000000	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000

8 rows × 34 columns

Probably, most of these should be rescaled with standard scaling.

# Let's look at the distribution of features
plot_features = ['LotFrontage', 'BsmtFinSF1', '1stFlrSF', 'ScreenPorch']
fig, axes = plt.subplots(2,2)
fig.set_size_inches(10,6)
axes = axes.flatten()
for ix, feature in enumerate(plot_features):
    df[feature].value_counts().plot(kind='hist', ax=axes[ix], bins=20, logy=True)
    axes[ix].set_title(feature)
plt.tight_layout()

../_images/36164e6c9c8d666d7f77890a222b79f4621b09a34dcb0479b78e203494fbae7a.png

Some of these features have very low variance. They should probably be removed. Others should possibly be transformed using Box-Cox or similar.

# Let's check to see if any of our numeric features are highly correlated with one another
import seaborn as sns
plt.gcf().set_size_inches(10,8)
sns.clustermap(df[numeric_features].corr())

<seaborn.matrix.ClusterGrid at 0x15448283b5e0>

<Figure size 720x576 with 0 Axes>

../_images/c43efda79b538ad47208c372d2b3ff8bf7c0d979cf16964b7516862fe7c7e6e8.png

Overall, our features are relatively uncorrelated. Some features are highly correlated. For instance:

TotalBsmtSF, 1stFlrSF
GarageCars, GarageArea
BsmtFnSF1, BsmtFullBath

We could consider removing some of these near duplicate features.

First pass#

It’s usually a good idea to first perform a quick and dirty analysis ignoring the subtleties that we discussed above. We can then try to incorporate these ideas to improve our results.

# define input/target
inputs = features.variable.iloc[1:].tolist()
X, y = df[inputs], df['SalePrice']
X.shape, y.shape

((1460, 76), (1460,))

# split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)
X_train.shape, X_test.shape

((1168, 76), (292, 76))

Preprocessing numeric and categorical features#

We will use scikit-learn’s pipeline feature to streamline the model fitting and evaluation process.

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer, ColumnTransformer

# preprocessing pipeline for numeric quantities
preproc_numeric = Pipeline([
    ('fill_na', SimpleImputer(strategy='median')),
    ('scale', StandardScaler())
])

# preprocessing pipeline for categorical quantities
preproc_categorical = Pipeline([
    ('fill_na', SimpleImputer(strategy='most_frequent')),
    ('encode', OneHotEncoder(sparse_output=False, drop='first', min_frequency=5, handle_unknown='infrequent_if_exist'))
])

# full preprocessing pipeline
preproc = ColumnTransformer(
    transformers=[
    ('numeric', preproc_numeric, numeric_features),
    ('categorical', preproc_categorical, categorical_features)
], verbose_feature_names_out=False)

preproc.fit(X_train)

ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('fill_na',
                                                  SimpleImputer(strategy='median')),
                                                 ('scale', StandardScaler())]),
                                 ['LotFrontage', 'LotArea', 'OverallQual',
                                  'OverallCond', 'YearBuilt', 'YearRemodAdd',
                                  'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
                                  'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
                                  '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
                                  'BsmtFullBath', 'BsmtHal...
                                  'LotShape', 'LandContour', 'Utilities',
                                  'LotConfig', 'LandSlope', 'Neighborhood',
                                  'Condition1', 'Condition2', 'BldgType',
                                  'HouseStyle', 'RoofStyle', 'RoofMatl',
                                  'Exterior1st', 'Exterior2nd', 'MasVnrType',
                                  'ExterQual', 'ExterCond', 'Foundation',
                                  'BsmtQual', 'BsmtCond', 'BsmtExposure',
                                  'BsmtFinType1', 'BsmtFinType2', 'Heating',
                                  'HeatingQC', 'CentralAir', 'Electrical', ...])],
                  verbose_feature_names_out=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

ColumnTransformer

ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('fill_na',
                                                  SimpleImputer(strategy='median')),
                                                 ('scale', StandardScaler())]),
                                 ['LotFrontage', 'LotArea', 'OverallQual',
                                  'OverallCond', 'YearBuilt', 'YearRemodAdd',
                                  'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
                                  'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
                                  '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
                                  'BsmtFullBath', 'BsmtHal...
                                  'LotShape', 'LandContour', 'Utilities',
                                  'LotConfig', 'LandSlope', 'Neighborhood',
                                  'Condition1', 'Condition2', 'BldgType',
                                  'HouseStyle', 'RoofStyle', 'RoofMatl',
                                  'Exterior1st', 'Exterior2nd', 'MasVnrType',
                                  'ExterQual', 'ExterCond', 'Foundation',
                                  'BsmtQual', 'BsmtCond', 'BsmtExposure',
                                  'BsmtFinType1', 'BsmtFinType2', 'Heating',
                                  'HeatingQC', 'CentralAir', 'Electrical', ...])],
                  verbose_feature_names_out=False)

numeric

['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']

SimpleImputer

SimpleImputer(strategy='median')

StandardScaler

StandardScaler()

categorical

['MSSubClass', 'MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence', 'SaleType', 'SaleCondition']

SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

OneHotEncoder(drop='first', handle_unknown='infrequent_if_exist',
              min_frequency=5, sparse_output=False)

preproc.transform(X_test)

	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	BsmtUnfSF	...	Fence_MnWw	SaleType_ConLD	SaleType_New	SaleType_WD	SaleType_infrequent_sklearn	SaleCondition_Alloca	SaleCondition_Family	SaleCondition_Normal	SaleCondition_Partial	SaleCondition_infrequent_sklearn
892	-0.012468	-0.211594	-0.088934	2.165000	-0.259789	0.873470	-0.597889	0.472844	-0.285504	-0.391317	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
1105	1.234520	0.145643	1.374088	-0.524174	0.751222	0.487465	1.498567	1.276986	-0.285504	-0.312872	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
413	-0.635963	-0.160826	-0.820445	0.372217	-1.433867	-1.683818	-0.597889	-0.971996	-0.285504	0.980347	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
522	-0.903175	-0.529035	-0.088934	1.268609	-0.781602	-1.683818	-0.597889	-0.102477	-0.285504	0.077111	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
1036	0.833703	0.205338	2.105599	-0.524174	1.175195	1.114724	-0.192497	1.255193	-0.285504	0.061422	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
479	-0.903175	-0.443026	-1.551955	1.268609	-1.107734	0.728718	1.921333	-0.605882	-0.285504	0.377443	...	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0
1361	2.392439	0.508459	0.642577	-0.524174	1.109968	0.969972	-0.505228	1.804363	-0.285504	-0.705096	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
802	-0.324216	-0.231585	0.642577	-0.524174	1.109968	0.969972	-0.597889	0.440155	-0.285504	-1.099561	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
651	-0.457822	-0.149296	-1.551955	-0.524174	-1.009895	-1.683818	-0.597889	-0.971996	-0.285504	0.413303	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
722	-0.012468	-0.238931	-1.551955	1.268609	-0.031496	-0.718804	-0.597889	-0.555760	-0.285504	0.229518	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0

292 rows × 226 columns

Model fitting#

# create the model pipeline
from sklearn.linear_model import Ridge

pipe = Pipeline([
    ('preproc', preproc), 
    ('estimator', Ridge(alpha=1))
])

# fit
pipe.fit(X_train, y_train)

Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('fill_na',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['LotFrontage', 'LotArea',
                                                   'OverallQual', 'OverallCond',
                                                   'YearBuilt', 'YearRemodAdd',
                                                   'MasVnrArea', 'BsmtFinSF1',
                                                   'BsmtFinSF2', 'BsmtUnfSF',
                                                   'TotalBsmtSF', '1stFlrSF',
                                                   '2ndFlrSF', 'LowQualFinSF',
                                                   'GrLivAr...
                                                   'Neighborhood', 'Condition1',
                                                   'Condition2', 'BldgType',
                                                   'HouseStyle', 'RoofStyle',
                                                   'RoofMatl', 'Exterior1st',
                                                   'Exterior2nd', 'MasVnrType',
                                                   'ExterQual', 'ExterCond',
                                                   'Foundation', 'BsmtQual',
                                                   'BsmtCond', 'BsmtExposure',
                                                   'BsmtFinType1',
                                                   'BsmtFinType2', 'Heating',
                                                   'HeatingQC', 'CentralAir',
                                                   'Electrical', ...])],
                                   verbose_feature_names_out=False)),
                ('estimator', Ridge(alpha=1))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('fill_na',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['LotFrontage', 'LotArea',
                                                   'OverallQual', 'OverallCond',
                                                   'YearBuilt', 'YearRemodAdd',
                                                   'MasVnrArea', 'BsmtFinSF1',
                                                   'BsmtFinSF2', 'BsmtUnfSF',
                                                   'TotalBsmtSF', '1stFlrSF',
                                                   '2ndFlrSF', 'LowQualFinSF',
                                                   'GrLivAr...
                                                   'Neighborhood', 'Condition1',
                                                   'Condition2', 'BldgType',
                                                   'HouseStyle', 'RoofStyle',
                                                   'RoofMatl', 'Exterior1st',
                                                   'Exterior2nd', 'MasVnrType',
                                                   'ExterQual', 'ExterCond',
                                                   'Foundation', 'BsmtQual',
                                                   'BsmtCond', 'BsmtExposure',
                                                   'BsmtFinType1',
                                                   'BsmtFinType2', 'Heating',
                                                   'HeatingQC', 'CentralAir',
                                                   'Electrical', ...])],
                                   verbose_feature_names_out=False)),
                ('estimator', Ridge(alpha=1))])

preproc: ColumnTransformer

ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('fill_na',
                                                  SimpleImputer(strategy='median')),
                                                 ('scale', StandardScaler())]),
                                 ['LotFrontage', 'LotArea', 'OverallQual',
                                  'OverallCond', 'YearBuilt', 'YearRemodAdd',
                                  'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
                                  'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
                                  '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
                                  'BsmtFullBath', 'BsmtHal...
                                  'LotShape', 'LandContour', 'Utilities',
                                  'LotConfig', 'LandSlope', 'Neighborhood',
                                  'Condition1', 'Condition2', 'BldgType',
                                  'HouseStyle', 'RoofStyle', 'RoofMatl',
                                  'Exterior1st', 'Exterior2nd', 'MasVnrType',
                                  'ExterQual', 'ExterCond', 'Foundation',
                                  'BsmtQual', 'BsmtCond', 'BsmtExposure',
                                  'BsmtFinType1', 'BsmtFinType2', 'Heating',
                                  'HeatingQC', 'CentralAir', 'Electrical', ...])],
                  verbose_feature_names_out=False)

numeric

['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']

SimpleImputer

SimpleImputer(strategy='median')

StandardScaler

StandardScaler()

categorical

['MSSubClass', 'MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence', 'SaleType', 'SaleCondition']

SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

OneHotEncoder(drop='first', handle_unknown='infrequent_if_exist',
              min_frequency=5, sparse_output=False)

Ridge

Ridge(alpha=1)

Model evaluation#

# evaluate
# For random forest regressor, score returns the R^2 value
print("Train R^2:", pipe.score(X_train, y_train))
print("Test R^2:", pipe.score(X_test, y_test))

Train R^2: 0.9028881797596011
Test R^2: 0.8714648323729995

# MAE is a more intuitive metric for price estimation
from sklearn.metrics import mean_absolute_error
y_train_pred_1stpass = pipe.predict(X_train)
y_test_pred_1stpass = pipe.predict(X_test)
print("Train MAE:", mean_absolute_error(y_train, y_train_pred_1stpass))
print("Test MAE:", mean_absolute_error(y_test, y_test_pred_1stpass))

Train MAE: 15288.339516024189
Test MAE: 20544.401602914677

# more robust classification with cross validation
from sklearn.model_selection import KFold, cross_validate
cv = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_validate(pipe, X, y, n_jobs=4, return_train_score=True, scoring='neg_mean_absolute_error')
cv_scores

{'fit_time': array([0.16545057, 0.14652419, 0.1494379 , 0.14467359, 0.12987018]),
 'score_time': array([0.03685045, 0.03611279, 0.03650355, 0.03635812, 0.03367758]),
 'test_score': array([-18480.22849655, -20702.55584883, -20390.70685371, -17442.23199009,
        -19762.00113844]),
 'train_score': array([-15920.87447649, -14995.07168281, -15245.2950543 , -15810.8980323 ,
        -13704.36598872])}

import scipy
def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, (m-h, m+h)

mean_confidence_interval(-cv_scores['test_score'])

(19355.544865524786, (17657.804810833954, 21053.284920215618))

Assignment#

If you didn’t already, apply the methodology described above to your own dataset. Explore additional ways to improve the model. For instance, we saw that GradientBoostingRegressor had an edge on Ridge regression without any hyperparameter tuning. Explore the hyperparameters for GradientBoostingRegressor on the sklearn documentation page. Define a hyperparameter search using GridSearchCV and use our evaluation methodology to compare the two methods.

Model Pipelines in Sklearn

Contents

Model Pipelines in Sklearn#

The Data#

Exploratory Analysis#

First pass#

Preprocessing numeric and categorical features#

Model fitting#

Model evaluation#

Optimizing the pipeline#

Tuning estimator parameters#

Searching over multiple parameters simultaneously#

Searching over different transformers#

Searching different estimators#

Dimension reduction#

Final Evaluation#

Assignment#