Machine Learning in Python using Clemson High Performance Computing#
Instructor: Carl Ehrett
Email: cehrett@clemson.edu
1. Welcome and Overview#
In this workshop, we will introduce the basics of machine learning using Python. We will focus on machine learning using “tabular data” (i.e. spreadsheet-style data), as opposed to images or unstructured text, though most of what we talk about will also apply to those domains. Image-related ML and generative AI both typically use deep learning neural networks which are not the focus of this workshop, but will be the focuses of other upcoming CCIT workshops. In this workshop, we will emphasize the use of Clemson’s Palmetto Cluster for running machine learning algorithms on large datasets. We will cover the following topics:
What is machine learning?
What are some of the python tools that facilitate machine learning?
What are the different types of machine learning?
What are some of the common machine learning algorithms?
How do we evaluate the performance of machine learning algorithms?
How do we explore and clean data?
How do we prepare data for machine learning?
How do we make use of Clemson’s Palmetto Cluster to efficiently run our machine learning code?
How can we run code that is too complex, or use data that is too large, for a Jupyter notebook?
What sorts of Palmetto resources should we request to allocate for our machine learning jobs?
1.1 Getting started#
You can download this notebook and its contents as follows.
In the terminal, run the following command: wget https://raw.githubusercontent.com/clemsonciti/rcde_workshops/master/python_sklearn/download.sh
This copies to your drivespace a script download.sh
that, when run, will copy the full workshop files to your drivespace. So now that you have that script, run the command: bash download.sh
. You should now have a folder, python_sklearn
, which contains this notebook and the rest of the workshop.
You can run most of this notebook using the default kernel, though some of the code cells will only run if you have created an environment with specialized libraries installed.
1.2 What is machine learning?#
People use the term “machine learning” in a variety of ways. Some people use it more or less synonymously with “artificial intelligence.” And these days, AI does indeed usually work under the paradigm of machine learning. But “machine learning” refers to the use of algorithms to learn from data. The contrast here is with traditional programming, where a programmer writes code that tells the computer exactly what to do. In machine learning, the programmer writes code that tells the computer how to learn from data to make decisions.
2. Setting up the Environment#
2.1 Creating a Conda Environment#
Why we use conda for ML environments:
Simplified package management and dependency resolution
Easy creation and management of isolated environments
Cross-platform compatibility (Windows, macOS, Linux)
Support for multiple programming languages (not just Python)
Ability to specify and replicate exact environment configurations
Large repository of pre-built packages optimized for different systems
# Commands for creating and activating a conda environment
conda create -n hpc_ml -c rapidsai -c conda-forge -c nvidia cudf cuml numpy pandas scikit-learn matplotlib seaborn rapids jupyterlab python=3.11 'cuda-version>=12.0,<=12.5'
# Warning: the above command may take awhile! ~20 minutes. Next:
source activate hpc_ml
2.2 Registering as a jupyter kernel#
In addition to installing JupyterLab, we need to register our environment as a Jupyter kernel in order for it to show up as an option for us when running a notebook.
# Register the env as a kernel
python -m ipykernel install --user --name hpc_ml --display-name "HPC_ML"
3. Example end-to-end ML: Forest Covertypes Dataset#
First, we will use the Forest Covertypes dataset, which is a dataset that contains information about the forest cover type in the Roosevelt National Forest of northern Colorado. The dataset contains 581,012 samples and 54 features. The goal is to predict the forest cover type based on the cartographic features provided.
from sklearn.datasets import fetch_covtype
cov_type = fetch_covtype()
As we should always do with any dataset we’re working with, we should poke around it a bit to see what it looks like. It’s always good to know what is the datatype of the object we’re working with.
type(cov_type)
In this case we’ve got a scikit-learn “Bunch” object. A quick google search shows us this page of documentation: sklearn.utils.Bunch Documentation, where we learn that a Bunch is a dictionary-like object that exposes keys as attributes. So, let’s see what keys are in this Bunch object.
cov_type.keys()
print(cov_type.DESCR)
cov_type.data
cov_type.data.shape
cov_type.target
cov_type.target.shape
cov_type.frame
type(cov_type.frame)
cov_type.target_names
cov_type.feature_names
import pandas as pd
# Create a DataFrame using the feature names and data from cov_type
df_cov_type = pd.DataFrame(data=cov_type.data, columns=cov_type.feature_names)
# Display the first few rows of the DataFrame
print("First few rows of the Forest Covertypes Dataset:")
df_cov_type
df_cov_type[[col for col in df_cov_type.columns if 'Soil_Type' in col]].describe()
Let’s use the K-nearest neighbors algorithm to try to predict covertype using the information contained in the dataset.
# import libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# Split the data into features (X) and target (y)
X, y = cov_type.data, cov_type.target
# Print the shape of the data
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print()
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=355)
# Print the shapes of the training and testing sets
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
Now it’s time to train the model.
# Initialize and train the classifier
kn_classifier = KNeighborsClassifier(n_neighbors=5)
kn_classifier.fit(X_train, y_train)
Now let’s see how the model performed.
import time
# Start a timer
start_time = time.time()
# Make predictions on the test set
y_pred = kn_classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Print results
print(f"Accuracy: {accuracy:.2f}")
# Get unique class labels
unique_labels = np.unique(y)
target_names = [f"Class {label}" for label in unique_labels]
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))
# See how long that took
print(f"\nTotal time taken: {time.time() - start_time:.4f} seconds")
The above code works fine, but takes a long time to run! On the order of 5-10 minutes, depending on how fast your machine is. And that’s not even bad, as far as what machine learning requires.
Now let’s look at a version of the same machine learning approach that makes better use of the resources available to us – using both parallelization across cores and also using the GPU we’ve provisioned to speed things up.
First, we’ll load the data and prepare it for the KNN algorithm.
import cudf
from cuml import KNeighborsClassifier as cuKNN
from dask.distributed import Client, LocalCluster
# Initialize Dask client for distributed computing
n_workers = 4 # Adjust based on your HPC resources
cluster = LocalCluster(n_workers=n_workers)
client = Client(cluster)
# Convert data to cuDF DataFrames for GPU processing
X = cudf.DataFrame(cov_type.data, columns=cov_type.feature_names)
y = cudf.Series(cov_type.target)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=355)
Now we fit the KNN model.
# Initialize and train the classifier
kn_classifier_cuml = cuKNN(n_neighbors=5)
kn_classifier_cuml.fit(X_train, y_train)
Now let’s see how this model performed.
# Start a timer
start_time = time.time()
# Make predictions on the test set
y_pred = kn_classifier_cuml.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test.to_cupy().get(), y_pred.to_cupy().get())
# Print results
print(f"Accuracy: {accuracy:.2f}")
# Convert predictions to numpy for classification report
y_test_np = y_test.to_numpy()
y_pred_np = y_pred.to_numpy()
# Get unique class labels
unique_labels = np.unique(y_test_np)
target_names = [f"Class {label}" for label in unique_labels]
print("\nClassification Report:")
print(classification_report(y_test_np, y_pred_np, target_names=target_names))
# Clean up
client.close()
cluster.close()
# See how long that took
print(f"\nTotal time taken: {time.time() - start_time:.4f} seconds")
Notice how much faster inference is in the second case! And this is just a simple case of getting predictions for 116k samples. Imagine if we needed to get predictions for millions of samples. In that case, the second approach would be much more feasible.
4. Hyperoptimization: search for the best version of the model to maximize performance#
In the above example, I used a KNN model with a k value of 5. But how do we know that k=5 is the best value? We don’t. We need to search for the best value of k. This is an example of hyperoptimization. You can think of hyperoptimization as a search for the best version of the model to maximize performance.
Hyperoptimization is inherently computationally expensive. It involves training many models with different hyperparameters and evaluating their performance. This is a perfect use case for the Palmetto 2 Cluster. We can use the cluster to train many models in parallel, which will speed up the hyperoptimization process. And if we submit our hyperoptimization job as a batch job, then we don’t need to tie up our local machine for hours – or days! – while the hyperoptimization process runs.
import cudf
import numpy as np
from cuml.neighbors import KNeighborsClassifier
from cuml.model_selection import train_test_split
from cuml.metrics import accuracy_score
# Convert data to cuDF DataFrames for GPU processing and ensure float32 dtype
X = cudf.DataFrame(cov_type.data, columns=cov_type.feature_names).astype('float32')
y = cudf.Series(cov_type.target)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=355)
# Use a smaller subset for hyperparameter tuning
X_tune, _, y_tune, _ = train_test_split(X_train, y_train, train_size=0.3, random_state=355)
# Define the parameter space for KNN
param_dist = {
'n_neighbors': np.arange(1, 21, dtype=int),
'p': [1, 2], # 1 for Manhattan distance, 2 for Euclidean distance
}
# Function to perform k-fold cross-validation with batched prediction
def cross_validate(X, y, model, n_splits=3, batch_size=10000):
fold_size = len(X) // n_splits
scores = []
for i in range(n_splits):
start = i * fold_size
end = (i + 1) * fold_size
X_val = X.iloc[start:end]
y_val = y.iloc[start:end]
X_train = cudf.concat([X.iloc[:start], X.iloc[end:]])
y_train = cudf.concat([y.iloc[:start], y.iloc[end:]])
model.fit(X_train, y_train)
# Batched prediction
y_pred = cudf.Series()
for j in range(0, len(X_val), batch_size):
X_batch = X_val.iloc[j:j+batch_size]
y_pred = cudf.concat([y_pred, model.predict(X_batch)])
score = accuracy_score(y_val, y_pred)
scores.append(score)
return np.mean(scores)
# Perform manual randomized search
n_iter = 10
best_score = 0
best_params = {}
for _ in range(n_iter):
params = {k: np.random.choice(v) for k, v in param_dist.items()}
knn = KNeighborsClassifier(metric='minkowski', **params)
score = cross_validate(X_tune, y_tune, knn)
if score > best_score:
best_score = score
best_params = params
print(f"Iteration {_+1}/{n_iter} - Score: {score:.4f} - Params: {params}")
print("Best parameters found:")
for param, value in best_params.items():
print(f"{param}: {value}")
# Train the best model on the full training set
best_knn = KNeighborsClassifier(**best_params)
best_knn.fit(X_train, y_train)
# Make predictions on the test set in batches
batch_size = 10000
y_pred = cudf.Series()
for i in range(0, len(X_test), batch_size):
X_batch = X_test.iloc[i:i+batch_size]
y_pred = cudf.concat([y_pred, best_knn.predict(X_batch)])
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nBest model accuracy: {accuracy:.4f}")
# Save the best model
import pickle
with open("best_knn_model.pkl", "wb") as f:
pickle.dump(best_knn, f)
The above, being in a Jupyter notebook, is fine for something that only takes a few minutes. But if we want to e.g. try hundreds of different settings, or a model that takes longer to fit, we should make a script that we can submit as a SLURM job. We just need to copy our code into a .py file and make a SLURM script.
# This can be what we put in our .py file:
import cudf
import numpy as np
from cuml.neighbors import KNeighborsClassifier
from cuml.model_selection import train_test_split
from cuml.metrics import accuracy_score
from sklearn.datasets import fetch_covtype
import pickle
# Load the covertype dataset
cov_type = fetch_covtype()
# Convert data to cuDF DataFrames for GPU processing and ensure float32 dtype
X = cudf.DataFrame(cov_type.data, columns=cov_type.feature_names).astype('float32')
y = cudf.Series(cov_type.target)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=355)
# Use a smaller subset for hyperparameter tuning
X_tune, _, y_tune, _ = train_test_split(X_train, y_train, train_size=0.3, random_state=355)
# Define the parameter space for KNN
param_dist = {
'n_neighbors': np.arange(1, 21, dtype=int),
'p': [1, 2], # 1 for Manhattan distance, 2 for Euclidean distance
}
# Function to perform k-fold cross-validation with batched prediction
def cross_validate(X, y, model, n_splits=3, batch_size=10000):
fold_size = len(X) // n_splits
scores = []
for i in range(n_splits):
start = i * fold_size
end = (i + 1) * fold_size
X_val = X.iloc[start:end]
y_val = y.iloc[start:end]
X_train = cudf.concat([X.iloc[:start], X.iloc[end:]])
y_train = cudf.concat([y.iloc[:start], y.iloc[end:]])
model.fit(X_train, y_train)
# Batched prediction
y_pred = cudf.Series()
for j in range(0, len(X_val), batch_size):
X_batch = X_val.iloc[j:j+batch_size]
y_pred = cudf.concat([y_pred, model.predict(X_batch)])
score = accuracy_score(y_val, y_pred)
scores.append(score)
return np.mean(scores)
# Perform manual randomized search
n_iter = 10
best_score = 0
best_params = {}
for _ in range(n_iter):
params = {k: np.random.choice(v) for k, v in param_dist.items()}
knn = KNeighborsClassifier(metric='minkowski', **params)
score = cross_validate(X_tune, y_tune, knn)
if score > best_score:
best_score = score
best_params = params
print(f"Iteration {_+1}/{n_iter} - Score: {score:.4f} - Params: {params}")
print("Best parameters found:")
for param, value in best_params.items():
print(f"{param}: {value}")
# Train the best model on the full training set
best_knn = KNeighborsClassifier(**best_params)
best_knn.fit(X_train, y_train)
# Make predictions on the test set in batches
batch_size = 10000
y_pred = cudf.Series()
for i in range(0, len(X_test), batch_size):
X_batch = X_test.iloc[i:i+batch_size]
y_pred = cudf.concat([y_pred, best_knn.predict(X_batch)])
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nBest model accuracy: {accuracy:.4f}")
# Save the best model
with open("best_knn_model.pkl", "wb") as f:
pickle.dump(best_knn, f)
And our SLURM script (saved in a separate file, e.g. hyperparam_opt.sh
) can be:
#!/bin/bash
#SBATCH --job-name=hyperparam_opt
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --time=16:00:00
#SBATCH --gres=gpu:v100:1
#SBATCH --output=hyperparam_opt_%j.out
# Load the modules we need
module load anaconda3
module load cuda
# Activate the environment we created to work in
source activate MLWorkshop
# Change to the directory where the .py script is
cd /home/[username]/dir/where/the/py/script/is/
# And run the script!
python hyperparameter_optimization.py
Then we can submit the job with sbatch hyperparam_opt.sh
, on the command line.