Data Preparation

Data Preparation#

So far we’ve assumed that we already have access to a dataset that is ready to be used for training a machine learning model. However, in practice, this is rarely the case. Most of the time, the dataset will need to be cleaned, transformed, and prepared before it can be used for training.

Stages of data preparation:

Data collection
Data loading
Data exploration
Data cleaning
Feature selection and engineering
Encoding categorical variables
Feature scaling
Data splitting

1. Data Collection#

Data collection is the process of gathering data from various sources. The data can be collected from databases, files, APIs, web scraping, etc. Outside of the scope of this notebook, but it’s an important step in the data preparation process.

2. Data Loading#

In this step, we load the data into the working environment. Your data might be in a CSV file, a JSON file, a SQL database, or any other format. We will use the pandas library to load the data into a DataFrame. Let’s use the “Heart Failure Clinical Records” dataset from the UCI Machine Learning Repository. This dataset contains the medical records of patients who had heart failure, collected during their follow-up period.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv"
df = pd.read_csv(url)

Note that we’re lucky in this dataset, in that it is small enough to fit into memory. For larger datasets, we would need to use more advanced techniques to load the data in chunks. For example, we might use Dask rather than pandas, which would allow us to work with larger-than-memory datasets. (Dask is designed to be a drop-in replacement for pandas for larger-than-memory datasets, so the code would look very similar.)

For very large datasets, when performing data exploration and when producing the code that will be used to prepare your data, you might want to work with a sample of the data rather than the full dataset. This will allow you to iterate more quickly and avoid long wait times. Once you have the code working with a sample, you can then run it on the full dataset.

## Load only a subset of the dataset
small_df = pd.read_csv(url, nrows=20)

# Display the shape of the datasets
print(df.shape)
print(small_df.shape)

# We'll work with the full dataset, so let's delete the small_df
del small_df

3. Data Exploration#

Data exploration is the process of getting to know the data. We look at the structure of the data, the summary statistics, and the distribution of the data. We also look for missing values, outliers, and anomalies in the data. This step is crucial for understanding the data and making decisions about how to clean and transform it.

It cannot be overemphasized that there is no one-size-fits-all approach to data exploration. The process will depend on the dataset, the problem you are trying to solve, and the questions you are trying to answer. A thorough understanding of the data – its sources, its structure, its quality – is essential for building a successful machine learning model.

# Get some basic information about the dataset

# Display the column names
print(df.columns)

If you were really working with this dataset for research purposes, you should know what each of these columns represents, as well as the units in which they are measured. That knowledge is crucial both for knowing how best to make use of the data, as well as for detecting problems in the data.

# Display basic information
print(df.info())
print("\nSample data:")
df.head()

# Summary statistics
print("\nSummary statistics:")
print(df.describe())

# The real data is very clean already. Let's introduce some missing values, and also some outliers.
# We'll do this by randomly selecting some values and setting them to NaN, and also by adding some random noise to some values.

# Randomly select 10% of the data and set them to NaN
df_nan = df.copy()
nan_indices = np.random.choice(df.index, size=int(len(df)*0.1), replace=False)
df_nan.loc[nan_indices, 'age'] = np.nan
df_nan.loc[nan_indices, 'serum_creatinine'] = np.nan
df_nan.loc[nan_indices, 'ejection_fraction'] = np.nan

# Randomly select 2% of the data and make them outliers
df_noisy = df_nan.copy()
noisy_indices = np.random.choice(df.index, size=int(len(df)*0.02), replace=False)
df_noisy.loc[noisy_indices, 'serum_creatinine'] = df_noisy.loc[noisy_indices, 'serum_creatinine'] * 10
df_noisy.loc[noisy_indices, 'ejection_fraction'] = df_noisy.loc[noisy_indices, 'ejection_fraction'] * 10

# Display the first few rows of the noisy dataset
print("\nNoisy dataset:")
df_noisy.head()

# Check for missing values
print("\nMissing values:")
print(df_noisy.isnull().sum())

Suppose that our target variable is the “DEATH_EVENT” column, which indicates whether the patient died during the follow-up period. We will explore the data to understand the relationships between the features and the target variable.

plt.figure(figsize=(4, 2))
df_noisy['DEATH_EVENT'].value_counts().plot(kind='bar')
plt.title('Distribution of Death Events')
plt.xlabel('Death Event')
plt.ylabel('Count')
plt.xticks([0, 1], ['Survived', 'Died'])
plt.show()

print("Percentage of deaths:", (df['DEATH_EVENT'].sum() / len(df)) * 100, "%")

Let’s look at the age distribution across the dataset. While we’re at it, let’s break it down by our target variable, DEATH_EVENT.

plt.figure(figsize=(6, 3))
sns.histplot(data=df_noisy, x='age', hue='DEATH_EVENT', kde=True, multiple="stack")
plt.title('Age Distribution by Outcome')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

print("Average age of survivors:", df_noisy[df_noisy['DEATH_EVENT'] == 0]['age'].mean())
print("Average age of non-survivors:", df_noisy[df_noisy['DEATH_EVENT'] == 1]['age'].mean())

Very frequently, a correlation heatmap is a good way to get a quick overview of the relationships between the features in the dataset. Note that this assumes that the features are all numeric. If you have categorical features, you will need to encode them as numbers before you can use a correlation heatmap.

plt.figure(figsize=(12, 9))
sns.heatmap(df_noisy.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

You might want to look at scatterplots to see if you pick up on any patterns in the data (or verify that patterns you expect to be there are really there).

plt.figure(figsize=(5, 3))
sns.scatterplot(data=df_noisy, x='ejection_fraction', y='serum_creatinine', hue='DEATH_EVENT')
plt.title('Ejection Fraction vs. Serum Creatinine')
plt.xlabel('Ejection Fraction (%)')
plt.ylabel('Serum Creatinine (mg/dL)')
plt.show()

What! There are some datapoints that look like bad outliers, before we even check what they are. And in this case, they are percentages that are well above 100%. Something is definitely wrong with those datapoints.

4. Data Cleaning#

First we’ll check for and handle outliers. We need to be thoughtful about this step! Every choice we make says something about how we expect the future data to look, and what we think is the reason why we have outliers in our data.

# Check for outliers using IQR method
def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df_noisy[(df_noisy[column] < lower_bound) | (df_noisy[column] > upper_bound)]
    return outliers

# Example: Check outliers in 'serum_creatinine'
creatinine_outliers = detect_outliers(df_noisy, 'serum_creatinine')
print("Outliers in serum_creatinine:")
print(creatinine_outliers[['age', 'sex', 'serum_creatinine', 'DEATH_EVENT']])

# Remove outliers (be cautious with this step in real-world scenarios!)
df_cleaned = df_noisy[~df_noisy.index.isin(creatinine_outliers.index)]

print("\nDataset shape after cleaning:", df_cleaned.shape)

Similarly, we must handle missing values in a way that is thoughtful about why the values might be missing. Are they missing at random? Are they missing because they are not applicable? Are they missing because they were never recorded? The answers to these questions will affect how we handle the missing values.

# Handle missing values

# Fill missing values with the mean
df_cleaned = df_cleaned.fillna(df_cleaned.mean())

# Alternatively, fill missing values with the median
# df_imputed = df_cleaned.fillna(df_cleaned.median())

# Or, if we're dealing with categorical data, we can fill missing values with the mode
# df_imputed = df_cleaned.fillna(df_cleaned.mode().iloc[0])

# We could also fit a machine learning model to predict missing values, but this is more complex and not always necessary.

# Or, just drop the missing rows. Which strategy makes the most sense depends crucially on your particular problem! No one-size-fits-all solutions.

# Check if there are any missing values left
print("\nMissing values after imputation:")
print(df_cleaned.isnull().sum())

5. Feature Selection and Engineering#

Feature selection is the process of selecting a subset of relevant features for use in model training. Feature engineering is the process of creating new features from the existing features in the dataset. Both of these processes are crucial for building a successful machine learning model.

# Create age groups
df_cleaned['age_group'] = pd.cut(df_cleaned['age'], bins=[30, 50, 70, 100], labels=['Middle-aged', 'Senior', 'Elderly'])

# Create a feature for multiple conditions
df_cleaned['multiple_conditions'] = ((df_cleaned['diabetes'] + df_cleaned['high_blood_pressure'] + df_cleaned['sex'] + df_cleaned['smoking']) > 1).astype(int)

# Log transform skewed features
df_cleaned['log_creatinine'] = np.log1p(df_cleaned['creatinine_phosphokinase'])

# Interaction terms
df_cleaned['ef_creatinine_interaction'] = df_cleaned['ejection_fraction'] * df_cleaned['serum_creatinine']

print("New features added:")
print(df_cleaned[['age_group', 'multiple_conditions', 'log_creatinine', 'ef_creatinine_interaction']].head())

# Visualize the effect of a new feature
plt.figure(figsize=(6,4))
sns.boxplot(x='age_group', y='serum_creatinine', hue='DEATH_EVENT', data=df_cleaned)
plt.title('Serum Creatinine by Age Group and Outcome')
plt.show()

6. Encoding Categorical Variables#

Most machine learning algorithms require that the input data be in numerical format. If the dataset contains categorical variables, we need to encode them into numerical format. One issue to which to be particularly sensitive is whether it is appropriate to treat a categorical variable as ordinal or nominal. If it is ordinal, then we should encode it as such. If it is nominal, then we should use “one-hot encoding”.

For example, suppose that our “age group” column were the only information about age that we have. If we treat it as ordinal, then we are saying that the different age groups are ordered in some way. If we treat it as nominal, then we are saying that the different age groups are not ordered in any way. Which seems appropriate here?

# Encode the `age_group` feature
df_encoded = df_cleaned.copy()
df_encoded['age_group_ordinal'] = df_cleaned['age_group'].cat.codes

# One-hot encode the `age_group` feature
df_encoded = pd.get_dummies(df_encoded, columns=['age_group'], drop_first=False)

df_encoded[[col for col in df_encoded.columns if 'age_group' in col]]

7. Feature Scaling#

Feature scaling is the process of standardizing the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. This is helpful for algorithms that rely on the magnitude of values, such as distance-based algorithms. Even when it doesn’t help the algorithm, it rarely hurts.

# Scale the numerical features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = df_cleaned.copy()
df_scaled[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time', 'log_creatinine', 'ef_creatinine_interaction']] = scaler.fit_transform(df_encoded[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time', 'log_creatinine', 'ef_creatinine_interaction']])
df_scaled.head()

8. Data Splitting#

On day 1 we discussed train/test splits and cross-validation. This is the final step of data preparation, and is often integrated into our model training/tuning process, especially when we are using cross-validation.