Data partition: training and testing#

In Machine Learning, it is mandatory to have training and testing set. Some time a verification set is also recommended. Here are some functions for splitting training/testing set in sklearn:

  • train_test_split: create series of test/training partitions

  • Kfold splits the data into k groups

  • StratifiedKFold splits the data into k groups based on a grouping factor.

  • RepeatKfold

  • LeaveOneOut

  • LeavePOut

We focus on train_test_split, KFolds and StratifiedKFold.

We will use the airquality dataset to demonstrate:

Dataset#

import pandas as pd
df = pd.DataFrame(pd.read_csv('/zfs/citi/workshop_data/python_ml/r_airquality.csv'))
df
Ozone Solar.R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
... ... ... ... ... ... ...
148 30.0 193.0 6.9 70 9 26
149 NaN 145.0 13.2 77 9 27
150 14.0 191.0 14.3 75 9 28
151 18.0 131.0 8.0 76 9 29
152 20.0 223.0 11.5 68 9 30

153 rows × 6 columns

# need to deal w/ missing values
import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df = pd.DataFrame(imputer.fit_transform(df), columns = df.columns)
df
Ozone Solar.R Wind Temp Month Day
0 41.00000 190.000000 7.4 67.0 5.0 1.0
1 36.00000 118.000000 8.0 72.0 5.0 2.0
2 12.00000 149.000000 12.6 74.0 5.0 3.0
3 18.00000 313.000000 11.5 62.0 5.0 4.0
4 42.12931 185.931507 14.3 56.0 5.0 5.0
... ... ... ... ... ... ...
148 30.00000 193.000000 6.9 70.0 9.0 26.0
149 42.12931 145.000000 13.2 77.0 9.0 27.0
150 14.00000 191.000000 14.3 75.0 9.0 28.0
151 18.00000 131.000000 8.0 76.0 9.0 29.0
152 20.00000 223.000000 11.5 68.0 9.0 30.0

153 rows × 6 columns

X, y = df.iloc[:,1:], df.iloc[:,0]
X.shape, y.shape
((153, 5), (153,))

Train-test split#

Here we use train_test_split to randomly split 60% data for training and the rest for testing:

image

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.6,random_state=123)
X_train.shape, y_train.shape
((91, 5), (91,))
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

model = LinearRegression()

model.fit(X_train, y_train) #Training the model, not running now
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"r^2 on the test set: {r2_score(y_test, y_pred): 0.2f}")
r^2 on the test set:  0.31

Question: What are some of the limitations with this approach?

Cross validation#

  • CV is a resampling process used to evaluate ML model on limited data sample.

  • The general procedure:

    • Shuffle data randomly

    • Split the data into k groups For each group:

      • Split into training & testing set

      • Fit a model on each group’s training & testing set

      • Retain the evaluation score and summarize the skill of model

image

Documentation on split

Question: How does this procedure address some of the limitations with a simple train/test split?

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

cv = KFold(n_splits=5, shuffle=True, random_state=20)

# initialize the model
model = LinearRegression()
r2s = []
for ix, (train_index, test_index) in enumerate(cv.split(X.values)):
    X_train = X.values[train_index]
    y_train = y.values[train_index]
    X_test = X.values[test_index]
    y_test = y.values[test_index]
    model.fit(X_train, y_train) #Training the model, not running now
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    r2s.append(r2)
    print(f"r^2 for the fold no. {ix+1} on the test set: {r2_score(y_test, y_pred):0.2f}")
    
print(f"Mean r^2: {np.mean(r2s):0.2f}")
print(f"Std. dev. r^2: {np.std(r2s):0.2f}")
r^2 for the fold no. 1 on the test set: 0.62
r^2 for the fold no. 2 on the test set: 0.42
r^2 for the fold no. 3 on the test set: 0.49
r^2 for the fold no. 4 on the test set: 0.29
r^2 for the fold no. 5 on the test set: 0.38
Mean r^2: 0.44
Std. dev. r^2: 0.11

Stratified k-fold CV#

“Stratify” refers to sampling within a group to form the \(k\) folds. For example, we may wish that the distribution of Ozone levels in the CV partitions looks similar to the distribution in the full dataset. This is especially important when we have class imbalance.

df.Ozone.hist()
<AxesSubplot:>
../_images/f7a022ed2d922cd8060786f66804c35523854b00830b75c889d2f0572b465086.png

It looks like we have a group of low-Ozone (<50) and high-Ozone (>=50) samples. Let’s define a new variable indicating if the sample is low-ozone or high-ozone using this threshold:

thresh = 50
df['high_ozone'] = df['Ozone']>=thresh
df.high_ozone.value_counts() / len(df)
False    0.771242
True     0.228758
Name: high_ozone, dtype: float64

Stratified k-fold will attempt to preserve the share of high/low ozone samples in each split.

# let's first look at the distribution in each split w/out stratification
for ix, (train_index, test_index) in enumerate(cv.split(X.values)):
    y_train = y.values[train_index]
    print(pd.Series(y_train>thresh).value_counts()/len(y_train))
False    0.803279
True     0.196721
dtype: float64
False    0.778689
True     0.221311
dtype: float64
False    0.762295
True     0.237705
dtype: float64
False    0.788618
True     0.211382
dtype: float64
False    0.756098
True     0.243902
dtype: float64
from sklearn.model_selection import StratifiedKFold
cv_strat = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)

# let's look at the distribution in each split w/out 
for ix, (train_index, test_index) in enumerate(cv_strat.split(X.values, y.values>=thresh)):
    y_train = y.values[train_index]
    print(pd.Series(y_train>thresh).value_counts()/len(y_train))
False    0.778689
True     0.221311
dtype: float64
False    0.778689
True     0.221311
dtype: float64
False    0.770492
True     0.229508
dtype: float64
False    0.780488
True     0.219512
dtype: float64
False    0.780488
True     0.219512
dtype: float64
# initialize the model
model = LinearRegression()
r2s = []
for ix, (train_index, test_index) in enumerate(cv_strat.split(X.values, y.values>=50)):
    X_train = X.values[train_index]
    y_train = y.values[train_index]
    X_test = X.values[test_index]
    y_test = y.values[test_index]
    model.fit(X_train, y_train) #Training the model, not running now
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    r2s.append(r2)
    print(f"r^2 for the fold no. {ix+1} on the test set: {r2_score(y_test, y_pred): 0.2f}")
    
print(f"Mean r^2: {np.mean(r2s):0.2f}")
print(f"Std. dev. r^2: {np.std(r2s):0.2f}")
r^2 for the fold no. 1 on the test set:  0.45
r^2 for the fold no. 2 on the test set:  0.43
r^2 for the fold no. 3 on the test set:  0.31
r^2 for the fold no. 4 on the test set:  0.56
r^2 for the fold no. 5 on the test set:  0.64
Mean r^2: 0.48
Std. dev. r^2: 0.11

Notice the better mean performance when stratifying.