Pytorch GPU support#

The palmetto cluster has many GPU compute nodes. A crucial feature of PyTorch is the support of GPUs–short for Graphics Processing Unit. A GPU can perform many thousands of small operations in parallel, making it perfect for performing large matrix operations in neural networks. You do not need to know anything about GPU programming to use PyTorch on the GPU!

When comparing GPUs to CPUs, we can list the following main differences (credit: Kevin Krewell, 2009)

CPUs and GPUs have both different advantages and disadvantages, which is why many computers contain both components and use them for different tasks. In case you are not familiar with GPUs, you can read up more details in this NVIDIA blog post or here.

GPUs can accelerate the training of your network up to a factor of \(100\) which is essential for large neural networks. PyTorch implements a lot of functionality for supporting GPUs (mostly those of NVIDIA due to the libraries CUDA and cuDNN). First, let’s check whether you have a GPU available:

import torch
gpu_avail = torch.cuda.is_available()
print(f'Is the GPU available? {"Yes" if gpu_avail else "No"}')
Is the GPU available? Yes

You can information about your GPU usage by opening a terminal and running the nvidia-smi command.

!nvidia-smi
Mon Jun 12 17:03:06 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   16C    P0    28W / 250W |     24MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3839      G   /usr/libexec/Xorg                  22MiB |
+-----------------------------------------------------------------------------+

By default, all tensors you create are stored on the CPU. We can push a tensor to the GPU by using the function .to(...), or .cuda(). However, it is often a good practice to define a device object in your code which points to the GPU if you have one, and otherwise to the CPU. Then, you can write your code with respect to this device object, and it allows you to run the same code on both a CPU-only system, and one with a GPU. Let’s try it below. We can specify the device as follows:

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("Device", device)
Device cuda

Let’s create a large tensor, push it to device.

x = torch.randn(1000, 1000, 1000).to(device) 
x.dtype, x.device
(torch.float32, device(type='cuda', index=0))

Question

Can you estimate how much VRAM this tensor should occupy?

!nvidia-smi
Mon Jun 12 17:03:14 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   16C    P0    36W / 250W |   4455MiB / 12288MiB |     80%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3839      G   /usr/libexec/Xorg                  22MiB |
|    0   N/A  N/A   3850647      C   ...torch_workshop/bin/python     4431MiB |
+-----------------------------------------------------------------------------+
del x

# the gpu memory may not be freed right away
# we can explicitly free
torch.cuda.empty_cache()
!nvidia-smi
Mon Jun 12 17:03:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   16C    P0    38W / 250W |    639MiB / 12288MiB |     34%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3839      G   /usr/libexec/Xorg                  22MiB |
|    0   N/A  N/A   3850647      C   ...torch_workshop/bin/python      615MiB |
+-----------------------------------------------------------------------------+

Tensors must be on the same device

a = torch.randn(3,3, device = torch.device('cpu')) # the default
b = torch.randn(3,3, device = torch.device('cuda')) # gpu

try:
    print(a+b)
except RuntimeError as e:
    print(e)
Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

This will happen to you a lot

try:
    # tensor is too big to fit on our GPU!
    x = torch.randn(int(3e10), device=torch.device('cuda'))
except Exception as e:
    print(e)
CUDA out of memory. Tried to allocate 111.76 GiB (GPU 0; 11.91 GiB total capacity; 512 bytes already allocated; 11.29 GiB free; 2.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The solution: find ways to use less memory.

How much does GPU actually help?

Let’s test by multiplying a large matrix with itself.

import time

x = torch.randn(10000, 10000)

## CPU version
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
print(f"CPU time: {(end_time - start_time):6.5f}s")

## GPU version
x = x.to(device)
_ = torch.matmul(x, x)  # First operation to 'burn in' GPU
# CUDA is asynchronous, so we need to use different timing functions
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
_ = torch.matmul(x, x)
end.record()
torch.cuda.synchronize()  # Waits for everything to finish running on the GPU
print(f"GPU time: {0.001 * start.elapsed_time(end):6.5f}s")  # Milliseconds to seconds
CPU time: 2.38639s
GPU time: 0.24208s

Computation Graph and Backpropagation#

One of the main reasons for using PyTorch in Deep Learning projects is that we can automatically get gradients/derivatives of functions that we define. We will mainly use PyTorch for implementing neural networks, and they are just fancy functions. If we use weight matrices in our function that we want to learn, then those are called the parameters or simply the weights. The ability to compute gradients is essential for optimizing (a.k.a. training) our networks.

Tensors have a requires_grad attribute

x = torch.ones((3,))
print(x.requires_grad)
False

We can change this for an existing tensor using the function requires_grad_() (underscore indicating that this is a in-place operation). Alternatively, when creating a tensor, you can pass the argument requires_grad=True to most initializers we have seen above.

x.requires_grad_(True)
print(x.requires_grad)
True

In order to get familiar with the concept of a computation graph, we will create one for the following function:

\[y = \frac{1}{|x|}\sum_i \left[(x_i + 2)^2 + 3\right]\]

You could imagine that \(x\) are our parameters, and we want to optimize (either maximize or minimize) the output \(y\). For this, we want to obtain the gradients \(\partial y / \partial \mathbf{x}\). For our example, we’ll use \(\mathbf{x}=[0,1,2]\) as our input.

x = torch.arange(3, dtype=torch.float32, requires_grad=True) # Only float tensors can have gradients
print("X", x)
X tensor([0., 1., 2.], requires_grad=True)

Now let’s build the computation graph step by step. You can combine multiple operations in a single line, but we will separate them here to get a better understanding of how each operation is added to the computation graph.

a = x + 2
b = a ** 2
c = b + 3
y = c.mean()
print("Y", y)
Y tensor(12.6667, grad_fn=<MeanBackward0>)

Using the statements above, we have created a computation graph that looks similar to the figure below:

We calculate \(a\) based on the inputs \(x\) and the constant \(2\), \(b\) is \(a\) squared, and so on. The visualization is an abstraction of the dependencies between inputs and outputs of the operations we have applied.

Each node in the computation graph has automatically defined a function for calculating the gradients with respect to its inputs, grad_fn. Pytorch can use the chain rule to automatically compute the gradients of \(y\) with respect to any of the inputs that have requires_grad=True. Pytorch does this by traversing the computation graph backward, applying the grad_fn defined at each operation. This is called “backpropagation”.

We can perform backpropagation on the computation graph by calling the function backward() on the last output:

y.backward()

x.grad will now contain the gradient \(\partial y/ \partial \mathcal{x}\), and this gradient indicates how a change in \(\mathbf{x}\) will affect output \(y\) given the current input \(\mathbf{x}=[0,1,2]\):

print(x.grad)
tensor([1.3333, 2.0000, 2.6667])

We can also verify these gradients by hand. We will calculate the gradients using the chain rule, in the same way as PyTorch did it:

\[ \frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial c_i}\frac{\partial c_i}{\partial b_i}\frac{\partial b_i}{\partial a_i}\frac{\partial a_i}{\partial x_i} \]

Note that we have simplified this equation to index notation, and by using the fact that all operation besides the mean do not combine the elements in the tensor. The partial derivatives are:

\[ \frac{\partial a_i}{\partial x_i} = 1,\hspace{1cm} \frac{\partial b_i}{\partial a_i} = 2\cdot a_i\hspace{1cm} \frac{\partial c_i}{\partial b_i} = 1\hspace{1cm} \frac{\partial y}{\partial c_i} = \frac{1}{3} \]

Hence, with the input being \(\mathbf{x}=[0,1,2]\), our gradients are \(\partial y/\partial \mathbf{x}=[4/3,2,8/3]\). The previous code cell should have printed the same result.