Scripting Your Code#
Once you have drafted the code to prepare your data, split it and train/evaluate your model, you should package your code into a script that you can run via SLURM.
This will allow you to submit your job to the cluster and have it run asynchronously, without needing to keep your notebook open. It will also allow you to run your code on a larger dataset, or with more iterations, than you could do interactively.
What is different about running your code as a script vs. in a notebook? You need to ensure each of the following:
Your code runs from top to bottom without needing any manual intervention
Your code load the full dataset, not just a sample
Your code checks for unexpected conditions and handles them gracefully
Your code logs information about what it is doing, so you can debug it later if needed
For long jobs, your code checkpoints its progress so that it can resume where it left off if it is interrupted
Make sure any desired outputs (plots, model files, etc.) are saved to disk (rather than merely displayed in the notebook)
Figure out what resources you need to request
Let’s look at a couple of ways of converting code you developed in Jupyter into a SLURM-submittable script, and then dive into some of the above considerations.
1. Running your notebook directly as a script#
Probably the easiest route to converting your notebook is just to run it directly as a job on the cluster. This is possible with the command:
jupyter nbconvert --to notebook --execute --inplace [notebook_filename].ipynb
where you would replace notebook_filename
with whatever your notebook filename is. In a full SLURM script, this command might appear as follows:
#!/bin/bash
#SBATCH --job-name my-job-name
#SBATCH --nodes 1
#SBATCH --cpus-per-task 4
#SBATCH --gpus-per-node v100:1
#SBATCH --mem 8gb
#SBATCH --time 08:00:00
module load anaconda
cd /path/to/your/notebook
jupyter nbconvert --to notebook --execute --inplace [notebook_filename].ipynb
Notice that in this case I’ve determined that my code can make use of 4 cores each on 2 nodes, as well as a V100 GPU on each node. I’ve also requested 8GB of memory and 8 hours of runtime. You should adjust these values based on your needs.
2. Converting your notebook to a script#
The preferred coding practice would be to convert your notebook into a script yourself. If you have worked with Jupyter notebooks but not with .py scripts, you can think of the latter as being one big cell in a notebook. In fact, you can even make sure your code runs in a single Jupyter cell (including checkpoints, logging, etc.), and then simply copy that cell into a .py file.
3. Checkpointing#
If your code is going to take a long time to run, you should consider checkpointing it. This means saving the state of your code at regular intervals, so that if it is interrupted, you can resume from the last checkpoint rather than starting over from the beginning.
What exactly this looks like will depend on what you are doing. If you are searching over possible hyperparameters to find the best ones, then you should keep track of which hyperparameters you have tried and what the results were. See the below toy example.
from utils import create_answer_box
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import csv
import os
X, y = np.random.rand(1000, 5), np.random.randint(0, 2, 1000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Define hyperparameters to test
params = [(10, 5), (10, None), (50, 5), (50, None)]
# Define the results file, where we'll save the information about which hyperparameters we've already tested
results_file = 'results.csv'
# Load existing results we've already computed
if os.path.exists(results_file):
with open(results_file, 'r') as f:
done = set(tuple(row[:2]) for row in csv.reader(f))
else:
done = set()
# Train models and save results for each set of hyperparameters not already tested
with open(results_file, 'a', newline='') as f:
writer = csv.writer(f)
if not done:
writer.writerow(['n_estimators', 'max_depth', 'accuracy'])
for n_estimators, max_depth in params:
if (str(n_estimators), str(max_depth)) not in done:
model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
accuracy = accuracy_score(y_test, model.fit(X_train, y_train).predict(X_test))
writer.writerow([n_estimators, max_depth, accuracy])
print(f"n_estimators={n_estimators}, max_depth={max_depth}, accuracy={accuracy:.4f}")
print(f"Results saved to {results_file}")
n_estimators=10, max_depth=None, accuracy=0.5300
n_estimators=50, max_depth=None, accuracy=0.4350
Results saved to results.csv
If we have a long training run, we might want to save the model at regular intervals. This is especially important if we are training a model that takes a long time to train, or if we are training on a large dataset.
import os
import torch
import torch.nn as nn
# Define a simple neural network model
class NNModel(nn.Module):
def __init__(self, inputs=10):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(inputs, 200),
nn.ReLU(),
nn.Linear(200, 10),
nn.ReLU(),
nn.Linear(10, 1)
)
def forward(self, x):
return self.layers(x)
model = NNModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Make some random data
x = torch.randn(10000, 10) # Input data
y = torch.randn(10000) # Target data
best_loss = float('inf')
best_model = None
for epoch in range(20):
print(f"Epoch {epoch+1}")
# Forward pass
output = model(x).squeeze(1)
loss = nn.functional.mse_loss(output, y)
print(f"Loss: {loss.item()}")
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Track best model
if loss.item() < best_loss:
best_loss = loss.item()
best_model = model.state_dict()
print(f"New best model found with loss: {best_loss}")
torch.save({
'model': best_model,
'optimizer': optimizer.state_dict(),
'epoch': epoch,
'loss': best_loss
}, 'best_model.pt')
if (epoch + 1) % 5 == 0:
torch.save({
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch,
'loss': loss.item()
}, f'checkpoint_{epoch+1}.pt')
print("Training complete. Best model saved.")
Epoch 1
Loss: 1.0510294437408447
New best model found with loss: 1.0510294437408447
Epoch 2
Loss: 1.0472089052200317
New best model found with loss: 1.0472089052200317
Epoch 3
Loss: 1.001285433769226
New best model found with loss: 1.001285433769226
Epoch 4
Loss: 0.9997748732566833
New best model found with loss: 0.9997748732566833
Epoch 5
Loss: 1.0021926164627075
Epoch 6
Loss: 1.0016573667526245
Epoch 7
Loss: 0.999336302280426
New best model found with loss: 0.999336302280426
Epoch 8
Loss: 0.9966704845428467
New best model found with loss: 0.9966704845428467
Epoch 9
Loss: 0.994911253452301
New best model found with loss: 0.994911253452301
Epoch 10
Loss: 0.994596004486084
New best model found with loss: 0.994596004486084
Epoch 11
Loss: 0.9952884912490845
Epoch 12
Loss: 0.9958664178848267
Epoch 13
Loss: 0.9955121874809265
Epoch 14
Loss: 0.9943965077400208
New best model found with loss: 0.9943965077400208
Epoch 15
Loss: 0.9931568503379822
New best model found with loss: 0.9931568503379822
Epoch 16
Loss: 0.9923498034477234
New best model found with loss: 0.9923498034477234
Epoch 17
Loss: 0.9921700954437256
New best model found with loss: 0.9921700954437256
Epoch 18
Loss: 0.9922705888748169
Epoch 19
Loss: 0.9923542141914368
Epoch 20
Loss: 0.9921954870223999
Training complete. Best model saved.
Now, you create a python script that you can submit as a slurm job. You’ll need to make a python fit_nn.py
file and a slurm batch script. The following can be your slurm batch script:
#!/bin/bash
#SBATCH --job-name=workshop_fit_nn
#SBATCH --output=logs/%x-%j.out # Creates a log file in a logs/ directory
#SBATCH --error=logs/%x-%j.err
#SBATCH --time=00:05:00
#SBATCH --partition=work1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
# Load your environment
module load anaconda3
source activate hpc_ml # or your preferred env
# Run the Python script
python fit_nn.py
To make the fit_nn.py
file, develop your code in the below code cell. I’ve already put the extra import
statements you’ll need. Now copy my code from the cell above, and modify it so that it uses the California housing data instead of random x,y
tensors. To do this, you’ll need to:
Copy/paste the above code and add it to the below import statements
Load the California housing data (into regressors
X
and targety
)(Optionally) use
StandardScaler()
to scale theX
data (very advisable for training neural networks!)Convert
X
andy
to torch tensors using e.g.x = torch.tensor(X, dtype=torch.float32)
Make the neural network take 8 inputs (the code I wrote assumes 10 inputs)
Run the code cell to make sure it works here
Copy it to a new file
fit_nn.py
Create a new file
run_fit_nn.slurm
and copy the slurm batch script code above to itOpen a terminal window
Submit your neural network training job using
sbatch run_fit_nn.slurm
!
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split # optional
from sklearn.preprocessing import StandardScaler # optional
# Copy/paste the above code cell here, and modify it to use the California housing data.
create_answer_box("Were you able to successfully submit your job? If so, put your jobid here! If not, please describe what difficulties you encountered.", "06-01")
Were you able to successfully submit your job? If so, put your jobid here! If not, please describe what difficulties you encountered.
4. Resource allocation#
In order to effectively run your code on the cluster, you need to request the appropriate resources. This includes the number of nodes, the number of cores per node, the amount of memory, the amount of time, and the type of GPU.
When multiple cores help:
When you are running multiple independent jobs
When you are running a single job that can be parallelized (check the documentation!)
When you are running a single job that can be parallelized, but the parallelization is not built into the code (e.g. you are running multiple instances of the code with different hyperparameters)
When multiple cores don’t help:
When you are running a single job that cannot be parallelized
When you haven’t written your code to take advantage of multiple cores
When a GPU helps:
When you are running a deep learning model
When you are running a model that can be accelerated by a GPU (check the documentation!)
When a GPU doesn’t help:
When you are running a model that is not accelerated by a GPU
When you haven’t written your code to take advantage of a GPU
How much memory to request:
This depends on the size of your dataset and the size of your model, as well as whether you are using a GPU. E.g., if you are using a large language model, you might need a big GPU but not much memory. If you are using a large dataset, you might need a lot of memory but not a big GPU.
Be sure to use the jobperf
command to check how much memory your job is using. E.g., watch -n 2 jobperf [jobid]
will show you how much memory your job is using every 2 seconds.
Wrapping up#
Thank you for joining the workshop today! Please take a moment now to answer the below questions.
create_answer_box("Are there areas of machine learning that you wish were represented more in this workshop? If so, what are they?", "06-02")
create_answer_box("Are there other changes you would suggest for this workshop?", "06-03")
create_answer_box("What other workshop topics related to AI or machine learning would you like to see from CCIT?", "06-04")
create_answer_box("Please leave any additional comments or questions here.", "06-05")
Are there areas of machine learning that you wish were represented more in this workshop? If so, what are they?
Are there other changes you would suggest for this workshop?
What other workshop topics related to AI or machine learning would you like to see from CCIT?
Please leave any additional comments or questions here.