GPU Acceleration with Python#
GPU vs. CPU: When and Why to Use GPUs#
Understanding the differences:
A CPU (Central Processing Unit) focuses on versatility and efficiency with a few powerful cores optimized for sequential execution and complex tasks. It excels at operations requiring decision-making, such as control flow or branching.
In contrast, a GPU (Graphics Processing Unit) contains thousands of smaller cores designed to perform many tasks simultaneously. Its strength lies in parallel execution, which makes it ideal for workloads that involve processing large datasets with repetitive operations, like matrix multiplications or element-wise computations.
Use cases for GPUs:
Training neural networks: Deep learning frameworks, such as TensorFlow and PyTorch, leverage GPUs to train models faster by distributing operations over many cores.
Data analysis and scientific computing: GPUs shine when dealing with large datasets that require linear algebra operations, such as matrix inversions or eigenvalue decompositions.
Image and video processing: Tasks like image classification, object detection, and video encoding rely on pixel-level operations, which map well to GPU architectures.
Limitations of GPUs:
GPUs struggle with tasks that are inherently sequential, such as algorithms with many conditional branches or unpredictable memory access patterns.
Some workloads, such as database queries or text processing, may not benefit from GPUs because these tasks are I/O-bound or latency-sensitive and require more control logic than raw parallelism.
Performance considerations:
While GPUs can massively accelerate computation, data transfer between CPU and GPU memory can become a bottleneck. Minimizing these transfers is essential for performance.
On-GPU computation should be maximized by batching operations and keeping as much data in GPU memory as possible. Use libraries that optimize memory management, such as CUDA for Nvidia GPUs or ROCm for AMD GPUs, to avoid unnecessary overhead.
Introduction to GPU Libraries#
CuPy#
A library that provides GPU acceleration for NumPy operations by using CUDA. CUDA (Compute Unified Device Architecture) is a parallel computing platform and API developed by Nvidia that allows developers to run code directly on GPUs. CuPy uses CUDA to allow you to accelerate existing Python scripts with minimal code changes.
Example use-case: Consider a research problem involving large-scale linear algebra computations, such as solving systems of equations or performing matrix factorization. CuPy can be used to accelerate these operations by utilizing the GPU.
# Perform a simple operation in numpy
import numpy as np
# Perform non-GPU-accelerated array operations
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
result = np.add(x, y)
# Take a look at the result
result
# Perform same operations as in numpy but with cupy
import cupy as cp
# Perform GPU-accelerated array operations
x = cp.array([1, 2, 3])
y = cp.array([4, 5, 6])
result = cp.add(x, y)
# Take a look at the result
result
# Did it really land on the GPU? Let's check
result.device
Timed comparison: CuPy vs. NumPy for matrix multiplication. CuPy can be much faster if your code requires a lot of matrix operations.
We can see this by trying out matrix multiplication of a 16000x16000 matrix with CuPy and NumPy.
import numpy as np
import cupy as cp
import time
# NumPy (CPU) computation
start = time.time()
x_cpu = np.random.rand(16000, 16000)
result_cpu = np.dot(x_cpu, x_cpu)
end = time.time()
cpu_time = end - start
print(f"CPU Time: {cpu_time} seconds")
# CuPy (GPU) computation
start = time.time()
x_gpu = cp.random.rand(16000, 16000)
result_gpu = cp.dot(x_gpu, x_gpu)
end = time.time()
gpu_time = end - start
print(f"GPU Time: {gpu_time} seconds")
# How much faster was the GPU?
speedup = cpu_time / gpu_time
print(f"GPU was {speedup} times faster than CPU")
Introduction to GPU Libraries#
PyTorch#
A deep learning framework that provides GPU acceleration and automatic differentiation for building and training neural networks.
When to use CuPy vs. PyTorch: Use CuPy for general-purpose GPU-accelerated array operations (like large-scale linear algebra or scientific computing). In contrast, PyTorch is more appropriate when working with machine learning models and neural networks that require features like automatic differentiation, model training loops, and GPU optimization.
import torch
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Even better: let's throw an error if GPU is not available
assert device.type == "cuda", "GPU not available!"
# Define a tensor
tensor = torch.tensor([1.0, 2.0, 3.0])
# Move tensor to GPU
tensor = tensor.to(device)
# Perform operations on GPU
result = tensor * 2
result
Using neural networks: PyTorch makes it easy to move models and data between CPU and GPU.
model = torch.nn.Linear(10, 1).to(device)
input_data = torch.randn(5, 10).to(device)
output = model(input_data)
output
Timed comparison: Below is a placeholder for comparing the execution time of training a simple model on CPU versus GPU.
import time
import torch
input_dim = 100000
num_training_examples = 10000
output_dim = 10
# CPU Training
device_cpu = torch.device("cpu")
model_cpu = torch.nn.Linear(input_dim, output_dim).to(device_cpu)
optimizer_cpu = torch.optim.SGD(model_cpu.parameters(), lr=0.01) # Add optimizer
input_cpu = torch.randn(num_training_examples, input_dim).to(device_cpu)
target_cpu = torch.randn(num_training_examples, output_dim).to(device_cpu) # Dummy target
loss_fn = torch.nn.MSELoss() # Define loss function
start = time.time()
optimizer_cpu.zero_grad() # Zero gradients
output_cpu = model_cpu(input_cpu) # Forward pass
loss = loss_fn(output_cpu, target_cpu) # Compute loss
loss.backward() # Backward pass (compute gradients)
optimizer_cpu.step() # Update weights
end = time.time()
cpu_train_time = end - start
print(f"CPU Training Time: {cpu_train_time} seconds")
# GPU Training
device_gpu = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_gpu = torch.nn.Linear(input_dim, output_dim).to(device_gpu)
optimizer_gpu = torch.optim.SGD(model_gpu.parameters(), lr=0.01)
input_gpu = torch.randn(num_training_examples, input_dim).to(device_gpu)
target_gpu = torch.randn(num_training_examples, output_dim).to(device_gpu)
loss_fn = torch.nn.MSELoss()
start = time.time()
optimizer_gpu.zero_grad()
output_gpu = model_gpu(input_gpu)
loss = loss_fn(output_gpu, target_gpu)
loss.backward()
optimizer_gpu.step()
end = time.time()
gpu_train_time = end - start
print(f"GPU Training Time: {gpu_train_time} seconds")
# How much faster was the GPU?
speedup = cpu_train_time / gpu_train_time
print(f"GPU was {speedup} times faster than CPU")
# Clear GPU memory
del model_gpu
del input_gpu
del target_gpu
del optimizer_gpu
del x_gpu
del x_cpu
del model_cpu
del input_cpu
del target_cpu
del loss
del result_gpu
del result_cpu
torch.cuda.empty_cache()
Running Python Code on GPUs via SLURM#
SLURM basics for GPU jobs: To leverage GPUs on Palmetto, you need to request GPUs in your SLURM job script. Specify the number of GPUs required using the --gpus
option, as below.
#!/bin/bash
#SBATCH --job-name=gpu_job # Job name
#SBATCH --time=01:00:00 # Time limit hrs:min:sec
#SBATCH --mem=8G # Memory required per node
#SBATCH --gpus v100:1 # Request 1 V100 GPU
#SBATCH --cpus-per-task=4 # Request 4 CPU cores per task
# Load necessary modules
module load anaconda3
module load cuda
# Run the Python script
python your_script.py
Resource considerations: Ensure your code properly manages GPU resources, especially when using multiple GPUs, to avoid resource contention. Use tools like nvidia-smi
in the terminal to monitor GPU utilization.
# Check GPU usage
nvidia-smi
Multi-GPU programming: For workloads that can leverage multiple GPUs, libraries like PyTorch and TensorFlow provide easy-to-use APIs for distributing computations across multiple GPUs.
PyTorch’s DataParallel
splits input data across the specified GPUs, replicates the model on each one, and runs parallel computations. For more information about different ways to use PyTorch on multiple GPUs, check the PyTorch documentation.
import torch
import torch.nn as nn
# Define a simple linear model
model = nn.Linear(10, 2) # Input size 10, output size 2
# Wrap the model with DataParallel to use both GPUs (IDs 0 and 1)
model = nn.DataParallel(model, device_ids=[0, 1]).cuda()
# Create dummy input data: batch of 64 samples, each with 10 features
x = torch.randn(64, 10).cuda()
# Perform a forward pass with the input data
output = model(x)
# Print the output to verify multi-GPU execution
print(output)
If your model is too large to fit on a single GPU, you’ll need to use model parallelism instead of DataParallel. In model parallelism, different layers or parts of the model are distributed across multiple GPUs, allowing each GPU to hold a portion of the model’s parameters. PyTorch supports this through manual partitioning or torch.distributed APIs, though it requires more complex coding compared to DataParallel.