Debugging and Performance Tuning

Debugging and Performance Tuning#

Common Performance Bottlenecks in HPC Environments#

Memory limitations: Insufficient memory allocation can cause jobs to fail. Ensure your SLURM script requests enough memory with #SBATCH --mem. Consider memory usage profiling to identify peak memory demands. Requesting too much memory can lead to inefficient use of cluster resources, while requesting too little may cause your job to fail. Tools like jobstats can help analyze memory usage.
```
  jobstats <job_id>
```
OpenOnDemand provides a web-based interface for monitoring job statistics, including memory usage.

I/O bottlenecks: When reading or writing large files, poor I/O performance can limit overall job efficiency. Strategies like parallel I/O, caching frequently used data, and reducing the frequency of file access can help.

Text-based formats like CSV are particularly slow because each data point must be converted to text and parsed back, leading to high overhead.
- Parallel I/O libraries: Use libraries like HDF5 with parallel I/O support to improve file read/write performance. HDF5 stores data in a binary format, minimizing the overhead of serialization. Unlike CSV, which performs row-by-row writes and can grow quite large, HDF5 allows for chunked storage and direct memory mapping. This enables fast, scalable reads and writes, even for large, multidimensional datasets, without loading the entire file into memory.

# Example for parallel I/O with HDF5
import h5py

# Create a large dataset
with h5py.File('data.h5', 'w') as f:
    dset = f.create_dataset('dataset', (10000, 10000), dtype='f')

# Read the dataset in parallel
with h5py.File('data.h5', 'r', libver='latest', swmr=True) as f:
    dset = f['dataset']
    print(dset.shape)
    print(dset[...])

(10000, 10000)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Timed comparison: Let’s compare the performance of reading and writing data in CSV and HDF5 formats.

import h5py
import pandas as pd
import numpy as np
import time

# Create toy data
datasize = 5000
data = np.random.rand(datasize, datasize)

# Perform CSV read/write and measure total time
csv_filename = 'data.csv'
df = pd.DataFrame(data)

start = time.time()
df.to_csv(csv_filename, index=False)  # Write to CSV
df_loaded = pd.read_csv(csv_filename)  # Read from CSV
csv_total_time = time.time() - start
del df, df_loaded

print(f"Total CSV Time (write + read): {csv_total_time:.2f} seconds")

# Perform HDF5 read/write and measure total time
h5_filename = 'data.h5'

start = time.time()
with h5py.File(h5_filename, 'w') as f:  # Write to HDF5
    f.create_dataset('dataset', data=data)

with h5py.File(h5_filename, 'r') as f:  # Read from HDF5
    h5_data = f['dataset'][:]
h5_total_time = time.time() - start
del h5_data
del data

print(f"Total HDF5 Time (write + read): {h5_total_time:.2f} seconds")

# How much faster was HDF5?
speedup = csv_total_time / h5_total_time
print(f"HDF5 was {speedup:.2f}x faster than CSV.")

Total CSV Time (write + read): 37.24 seconds
Total HDF5 Time (write + read): 1.05 seconds
HDF5 was 35.56x faster than CSV.

Tools and Strategies for Debugging on the Cluster#

Debugging tools: Use available debugging tools such as gdb for C/C++ codes or Python debuggers like pdb. Slurm-specific commands (squeue, jobstats) can provide useful diagnostics for HPC jobs.

squeue -u <username>
jobstats <job_id>

SLURM job logs: Check the output and error logs generated by SLURM for details on why a job might have failed.

Debugging Python code: The pdb debugger can be useful for interactive debugging. Just import the pdb module and put pdb.set_trace() in your code where you want to start debugging. This will pause execution and drop you into the debugger prompt. Once you’re there:

Step through code: Use n (next) to execute the next line of code. This allows you to walk through the function line by line.
Step into functions: Use s (step) to step into a function call. This is useful when you want to dive into the details of a function.
Inspect variables: Use p (print) followed by a variable name to inspect its value at the current point in the code. This helps identify unexpected values.
Set breakpoints: Use b <line_number> to set a breakpoint at a specific line. Execution will pause when the code reaches that point.
Continue execution: Use c (continue) to resume running the code until the next breakpoint or the end of the program.
Exit the debugger: Use q (quit) to exit the debugger when you’re done or if you’ve found the issue.

def broken_function(data):
    # This function attempts to compute the sum of each row in the input matrix.
    row_sums = []
    for row in data:
        row_sums.append(sum(row))
    # Introduce a bug: trying to index into a non-existent element
    return row_sums[10000]  # Intentional out-of-bounds error

# Create a small toy dataset to trigger the error
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

broken_function(data)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[9], line 4
      1 # Create a small toy dataset to trigger the error
      2 data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
----> 4 broken_function(data)

Cell In[8], line 7, in broken_function(data)
      5     row_sums.append(sum(row))
      6 # Introduce a bug: trying to index into a non-existent element
----> 7 return row_sums[10000]

IndexError: list index out of range

As an alternative to adding pdb setpoints in your code, you can use pdb.run() to start the debugger at a specific function call. This can be useful for debugging functions that are called from multiple places in your codebase.

import pdb

# Set the breakpoint
pdb.run('broken_function(data)')

Another option in Jupyter is to use the %debug magic command. This will drop you into the debugger at the point where an exception was raised. You can then inspect variables, step through code, and identify the source of the error.

This is nice when you’re working in Jupyter anyway, but is not as powerful or as flexible as using pdb directly in a script or module.

Finally, the most powerful debugging tool for Python is probably the combination of pdb and a good IDE like PyCharm or VS Code. These tools provide a rich debugging environment with features like variable inspection, call stack visualization, and interactive breakpoints. You can use the VS Code debugger in an instance of VS Code on your local machine or using OpenOnDemand. See the VS Code documentation for more information on setting up and using the debugger.

Profiling tools#

Profiling is crucial for identifying performance bottlenecks.

cProfile: A built-in Python module to profile code and identify slow functions.

# Example of profiling with cProfile
import cProfile

def your_function():
    # Placeholder function for profiling
    pass

cProfile.run('your_function()')

Below is a simple example function that computes the sum of squares for a large array. We’ll profile this function to see how much time each part of the code takes.

import numpy as np
import cProfile

# Sub-function 1: Generate data with unnecessary transformations
def generate_data(size=1_000_000):
    data = np.random.rand(size)
    # Inefficient sorting operation to simulate overhead
    sorted_data = sorted(data, reverse=True)
    return np.array(sorted_data)

# Sub-function 2: Compute the sum of squares (with multiple layers of computation)
def sum_of_squares(data):
    squares = [x**2 for x in data]  # First, create a list of squares
    total_sum = sum(squares)  # Then, sum them up
    return total_sum

# Sub-function 3: Simulate a slow task with unnecessary work
def simulate_slow_task():
    total = 0
    for i in range(10):
        total += i % 3  
        # Sleep for 0.1 seconds to simulate a slow computation
        time.sleep(0.1)
    return total

# Main function that calls all components
def complex_function():
    data = generate_data(1_000_000)  # Generate data with overhead
    result1 = sum_of_squares(data)   # Compute sum of squares
    result2 = simulate_slow_task()   # Simulate slow task
    return result1 + result2

# Profile the complex function
cProfile.run('complex_function()')

         22 function calls in 1.943 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.170    0.170 581806734.py:12(sum_of_squares)
        1    0.124    0.124    0.124    0.124 581806734.py:13(<listcomp>)
        1    0.000    0.000    1.001    1.001 581806734.py:18(simulate_slow_task)
        1    0.033    0.033    1.943    1.943 581806734.py:27(complex_function)
        1    0.000    0.000    0.739    0.739 581806734.py:5(generate_data)
        1    0.000    0.000    1.943    1.943 <string>:1(<module>)
        1    0.000    0.000    1.943    1.943 {built-in method builtins.exec}
        1    0.597    0.597    0.597    0.597 {built-in method builtins.sorted}
        1    0.046    0.046    0.046    0.046 {built-in method builtins.sum}
        1    0.135    0.135    0.135    0.135 {built-in method numpy.array}
       10    1.001    0.100    1.001    0.100 {built-in method time.sleep}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.006    0.006    0.006    0.006 {method 'rand' of 'numpy.random.mtrand.RandomState' objects}

Interpreting the cProfile Output

Key Metrics:#

ncalls: Number of calls to the function.
tottime: Time spent in the function, excluding sub-calls.
cumtime: Total time spent, including sub-calls.
percall: Time per call (tottime/ncalls or cumtime/ncalls).
filename:lineno(function): Location of the function in your code.

Use this workflow to quickly identify and optimize the slowest parts of your code!