Pytorch GPU support

Pytorch GPU support#

The Palmetto cluster has many GPU compute nodes. A crucial feature of PyTorch is the support of GPUs–short for Graphics Processing Unit. A GPU can perform many thousands of small operations in parallel, making it perfect for performing large matrix operations in neural networks. You do not need to know much about GPU programming to use PyTorch on the GPU!

When comparing GPUs to CPUs, we can list the following main differences (credit: Kevin Krewell, 2009)

CPUs and GPUs have both different advantages and disadvantages, which is why many computers contain both components and use them for different tasks. In case you are not familiar with GPUs, you can read up more details in this NVIDIA blog post or here.

GPUs can accelerate the training of your network up to a factor of \(100\) which is essential for large neural networks. PyTorch implements a lot of functionality for supporting GPUs (mostly those of NVIDIA due to the libraries CUDA and cuDNN). First, let’s check whether you have a GPU available:

import torch

# And our input function
from utils import create_answer_box

create_answer_box("Have you used GPUs for computing before? If so, what for?", "02-01")

Have you used GPUs for computing before? If so, what for?

gpu_avail = torch.cuda.is_available()
print(f'Is the GPU available? {"Yes" if gpu_avail else "No"}')
assert gpu_avail, 'No graphics card available!'

Is the GPU available? No

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[3], line 3
      1 gpu_avail = torch.cuda.is_available()
      2 print(f'Is the GPU available? {"Yes" if gpu_avail else "No"}')
----> 3 assert gpu_avail, 'No graphics card available!'

AssertionError: No graphics card available!

You can information about your GPU usage by opening a terminal and running the nvidia-smi command.

!nvidia-smi

By default, all tensors you create are stored on the CPU. We can push a tensor to the GPU by using the function .to(...), or .cuda(). However, it is often a good practice to define a device object in your code which points to the GPU if you have one, and otherwise to the CPU. Then, you can write your code with respect to this device object, and it allows you to run the same code on both a CPU-only system, and one with a GPU. Let’s try it below. We can specify the device as follows:

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("Device", device)

Device cpu

Let’s create a large tensor, push it to device.

x = torch.randn(1000, 1000, 1000).to(device) 
x.dtype, x.device

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 x = torch.randn(1000, 1000, 1000).to(device) 
      2 x.dtype, x.device

NameError: name 'device' is not defined

create_answer_box(
    "PREDICTION: We just created a tensor of size (1000, 1000, 1000). How much GPU memory do you think this will use?", 
    "02-02"
)

!nvidia-smi

del x

# the gpu memory may not be freed right away
# we can explicitly free
torch.cuda.empty_cache()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 del x
      3 # the gpu memory may not be freed right away
      4 # we can explicitly free
      5 torch.cuda.empty_cache()

NameError: name 'x' is not defined

!nvidia-smi

Tensors must be on the same device

a = torch.randn(3,3, device = torch.device('cuda')) # GPU
b = torch.randn(3,3) # default

try:
    print(a+b)
except RuntimeError as e:
    print(e)

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

This will happen to you a lot

try:
    # tensor is too big to fit on our GPU!
    x = torch.randn(int(3e10), device=torch.device('cuda'))
except Exception as e:
    print(e)

CUDA out of memory. Tried to allocate 111.76 GiB. GPU 0 has a total capacity of 79.14 GiB of which 77.96 GiB is free. Process 3367483 has 594.00 MiB memory in use. Including non-PyTorch memory, this process has 418.00 MiB memory in use. Of the allocated memory 512 bytes is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The solution: find ways to use less memory.

How much does GPU actually help?

Let’s test by multiplying a large matrix with itself.

Before running the below code cell: we are multiplying two 10,000 x 10,000 matrices on CPU vs GPU. How much faster do you think the GPU will be?

!echo $OMP_NUM_THREADS
# !echo $MKL_NUM_THREADS
# !echo $OPENBLAS_NUM_THREADS
# !echo $NUMEXPR_NUM_THREADS
# !echo $TORCH_NUM_THREADS

import torch
import time

original_num_threads = torch.get_num_threads()
print(f"PyTorch default num threads: {original_num_threads}")

# Generate large random tensor
x = torch.randn(10_000, 10_000)

# Baseline: default threads
torch.set_num_threads(original_num_threads)  # Ensure defaults
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
default_time = end_time - start_time
print(f"Default threads time: {default_time:6.5f}s (Threads: {torch.get_num_threads()})")

# Force PyTorch to use 10 threads
torch.set_num_threads(10)
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
four_threads_time = end_time - start_time
print(f"4 threads time:      {four_threads_time:6.5f}s (Threads: {torch.get_num_threads()})")

# Restore default threads if desired (optional)
# torch.set_num_threads(logical_cores)

PyTorch default num threads: 1
Default threads time: 10.45505s (Threads: 1)
4 threads time:      1.05853s (Threads: 10)

import time

x = torch.randn(10_000, 10_000)

## CPU version
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
cpu_time = end_time - start_time
print(f"CPU time: {(cpu_time):6.5f}s")

# ## GPU version
# x = x.to(device)
# _ = torch.matmul(x, x)  # First operation to 'burn in' GPU
# # CUDA is asynchronous, so we need to use different timing functions
# start = torch.cuda.Event(enable_timing=True)
# end = torch.cuda.Event(enable_timing=True)
# start.record()
# _ = torch.matmul(x, x)
# end.record()
# torch.cuda.synchronize()  # Waits for everything to finish running on the GPU
# gpu_time = start.elapsed_time(end) / 1000  # Milliseconds to seconds
# print(f"GPU time: {gpu_time:6.5f}s")  # Milliseconds to seconds

# # How much faster is the GPU?
# print(f"The GPU is {cpu_time / gpu_time:6.1f}x faster than the CPU")

CPU time: 21.66699s

Computation Graph and Backpropagation#

One of the main reasons for using PyTorch in Deep Learning projects is that we can automatically get gradients/derivatives of functions that we define. We will mainly use PyTorch for implementing neural networks, and they are just fancy functions. If we use weight matrices in our function that we want to learn, then those are called the parameters or simply the weights. The ability to compute gradients is essential for optimizing (a.k.a. training) our networks.

Tensors have a requires_grad attribute

x = torch.ones((3,))
print(x.requires_grad)

False

We can change this for an existing tensor using the function requires_grad_() (underscore indicating that this is a in-place operation). Alternatively, when creating a tensor, you can pass the argument requires_grad=True to most initializers we have seen above.

x.requires_grad_(True)
print(x.requires_grad)

True

In order to get familiar with the concept of a computation graph, we will create one for the following function:

\[y = \frac{1}{\mathrm{dim}(x)}\sum_i \left[(x_i + 2)^2 + 3\right]\]

You could imagine that \(x\) are our parameters, and we want to optimize (either maximize or minimize) the output \(y\). For this, we want to obtain the gradients \(\partial y / \partial \mathbf{x}\). For our example, we’ll use \(\mathbf{x}=[0,1,2]\) as our input.

x = torch.arange(3, dtype=torch.float32, requires_grad=True) # Only float tensors can have gradients
print("X", x)

X tensor([0., 1., 2.], requires_grad=True)

Now let’s build the computation graph step by step. You can combine multiple operations in a single line, but we will separate them here to get a better understanding of how each operation is added to the computation graph.

a = x + 2
b = a ** 2
c = b + 3
y = c.mean()
print("Y", y)

Y tensor(12.6667, grad_fn=<MeanBackward0>)

Using the statements above, we have created a computation graph that looks similar to the figure below:

We calculate \(a\) based on the inputs \(x\) and the constant \(2\), \(b\) is \(a\) squared, and so on. The visualization is an abstraction of the dependencies between inputs and outputs of the operations we have applied.

Each node in the computation graph has automatically defined a function for calculating the gradients with respect to its inputs, grad_fn. Pytorch can use the chain rule to automatically compute the gradients of \(y\) with respect to any of the inputs that have requires_grad=True. Pytorch does this by traversing the computation graph backward, applying the grad_fn defined at each operation. This is called “backpropagation”.

We can perform backpropagation on the computation graph by calling the function backward() on the last output:

y.backward()

x.grad will now contain the gradient \(\partial y/ \partial \mathcal{x}\), and this gradient indicates how a change in \(\mathbf{x}\) will affect output \(y\) given the current input \(\mathbf{x}=[0,1,2]\):

print(x.grad)

tensor([1.3333, 2.0000, 2.6667])

We can also verify these gradients by hand. We will calculate the gradients using the chain rule, in the same way as PyTorch did it:

\[ \frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial c_i}\frac{\partial c_i}{\partial b_i}\frac{\partial b_i}{\partial a_i}\frac{\partial a_i}{\partial x_i} \]

Note that we have simplified this equation to index notation, and by using the fact that all operation besides the mean do not combine the elements in the tensor. The partial derivatives are:

\[ \frac{\partial a_i}{\partial x_i} = 1,\hspace{1cm} \frac{\partial b_i}{\partial a_i} = 2\cdot a_i\hspace{1cm} \frac{\partial c_i}{\partial b_i} = 1\hspace{1cm} \frac{\partial y}{\partial c_i} = \frac{1}{3} \]

Hence, with the input being \(\mathbf{x}=[0,1,2]\), our gradients are \(\partial y/\partial \mathbf{x}=[4/3,2,8/3]\). The previous code cell should have printed the same result.

# CODE CHALLENGE: Use pytorch to find the gradient of the multivariate function f(x, y) = x^2 + 2y^2 at the point (x, y) = (1, 2).

create_answer_box("Copy/paste your code to find the gradient of f(x, y) = x^2 + 2y^2 at (1, 2)", "02-03")

Pytorch GPU support

Contents

Pytorch GPU support#

Computation Graph and Backpropagation#