Scripting Your Code#
Once you have drafted the code to prepare your data, split it and train/evaluate your model, you should package your code into a script that you can run via SLURM.
This will allow you to submit your job to the cluster and have it run asynchronously, without needing to keep your notebook open. It will also allow you to run your code on a larger dataset, or with more iterations, than you could do interactively.
What is different about running your code as a script vs. in a notebook? You need to ensure each of the following:
Your code runs from top to bottom without needing any manual intervention
Your code load the full dataset, not just a sample
Your code checks for unexpected conditions and handles them gracefully
Your code logs information about what it is doing, so you can debug it later if needed
For long jobs, your code checkpoints its progress so that it can resume where it left off if it is interrupted
Make sure any desired outputs (plots, model files, etc.) are saved to disk (rather than merely displayed in the notebook)
Figure out what resources you need to request
Let’s look at a couple of ways of converting code you developed in Jupyter into a SLURM-submittable script, and then dive into some of the above considerations.
1. Running your notebook directly as a script#
Probably the easiest route to converting your notebook is just to run it directly as a job on the cluster. This is possible with the command:
jupyter nbconvert --to notebook --execute --inplace [notebook_filename].ipynb
where you would replace notebook_filename
with whatever your notebook filename is. In a full SLURM script, this command might appear as follows:
#!/bin/bash
#SBATCH --job-name my-job-name
#SBATCH --nodes 1
#SBATCH --cpus-per-task 4
#SBATCH --gpus-per-node v100:1
#SBATCH --mem 8gb
#SBATCH --time 08:00:00
module load anaconda
cd /path/to/your/notebook
jupyter nbconvert --to notebook --execute --inplace [notebook_filename].ipynb
Notice that in this case I’ve determined that my code can make use of 4 cores each on 2 nodes, as well as a V100 GPU on each node. I’ve also requested 8GB of memory and 8 hours of runtime. You should adjust these values based on your needs.
2. Converting your notebook to a script#
The preferred coding practice would be to convert your notebook into a script yourself. If you have worked with Jupyter notebooks but not with .py scripts, you can think of the latter as being one big cell in a notebook. In fact, you can even make sure your code runs in a single Jupyter cell (including checkpoints, logging, etc.), and then simply copy that cell into a .py file.
3. Checkpointing#
If your code is going to take a long time to run, you should consider checkpointing it. This means saving the state of your code at regular intervals, so that if it is interrupted, you can resume from the last checkpoint rather than starting over from the beginning.
What exactly this looks like will depend on what you are doing. If you are searching over possible hyperparameters to find the best ones, then you should keep track of which hyperparameters you have tried and what the results were. See the below toy example.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import csv
import os
X, y = np.random.rand(1000, 5), np.random.randint(0, 2, 1000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Define hyperparameters to test
params = [(10, 5), (10, None), (50, 5), (50, None)]
# Define the results file, where we'll save the information about which hyperparameters we've already tested
results_file = 'results.csv'
# Load existing results we've already computed
if os.path.exists(results_file):
with open(results_file, 'r') as f:
done = set(tuple(row[:2]) for row in csv.reader(f))
else:
done = set()
# Train models and save results for each set of hyperparameters not already tested
with open(results_file, 'a', newline='') as f:
writer = csv.writer(f)
if not done:
writer.writerow(['n_estimators', 'max_depth', 'accuracy'])
for n_estimators, max_depth in params:
if (str(n_estimators), str(max_depth)) not in done:
model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
accuracy = accuracy_score(y_test, model.fit(X_train, y_train).predict(X_test))
writer.writerow([n_estimators, max_depth, accuracy])
print(f"n_estimators={n_estimators}, max_depth={max_depth}, accuracy={accuracy:.4f}")
print(f"Results saved to {results_file}")
If we have a long training run, we might want to save the model at regular intervals. This is especially important if we are training a model that takes a long time to train, or if we are training on a large dataset.
import torch
import torch.nn as nn
# Define a simple neural network model
class NNModel(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(10, 200),
nn.ReLU(),
nn.Linear(200, 10),
nn.ReLU(),
nn.Linear(10, 1)
)
def forward(self, x):
return self.layers(x)
model = NNModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Make some random data
x = torch.randn(10000, 10) # Input data
y = torch.randn(10000, 1) # Target data
best_loss = float('inf')
best_model = None
for epoch in range(20):
print(f"Epoch {epoch+1}")
# Forward pass
output = model(x)
loss = nn.functional.mse_loss(output, y)
print(f"Loss: {loss.item()}")
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Track best model
if loss.item() < best_loss:
best_loss = loss.item()
best_model = model.state_dict()
print(f"New best model found with loss: {best_loss}")
torch.save({
'model': best_model,
'optimizer': optimizer.state_dict(),
'epoch': epoch,
'loss': best_loss
}, 'best_model.pt')
if (epoch + 1) % 5 == 0:
torch.save({
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch,
'loss': loss.item()
}, f'checkpoint_{epoch+1}.pt')
print("Training complete. Best model saved.")
4. Resource allocation#
In order to effectively run your code on the cluster, you need to request the appropriate resources. This includes the number of nodes, the number of cores per node, the amount of memory, the amount of time, and the type of GPU.
When multiple cores help:
When you are running multiple independent jobs
When you are running a single job that can be parallelized (check the documentation!)
When you are running a single job that can be parallelized, but the parallelization is not built into the code (e.g. you are running multiple instances of the code with different hyperparameters)
When multiple cores don’t help:
When you are running a single job that cannot be parallelized
When you haven’t written your code to take advantage of multiple cores
When a GPU helps:
When you are running a deep learning model
When you are running a model that can be accelerated by a GPU (check the documentation!)
When a GPU doesn’t help:
When you are running a model that is not accelerated by a GPU
When you haven’t written your code to take advantage of a GPU
How much memory to request:
This depends on the size of your dataset and the size of your model, as well as whether you are using a GPU. E.g., if you are using a large language model, you might need a big GPU but not much memory. If you are using a large dataset, you might need a lot of memory but not a big GPU.
Be sure to use the jobperf
command to check how much memory your job is using. E.g., watch -n 2 jobperf [jobid]
will show you how much memory your job is using every 2 seconds.