Accelerating Data Processing with Grid Stride Loops in CUDA

Published at

1/15/2025

What Are Grid Stride Loops?

Grid stride loops are a design pattern that extends the functionality of CUDA kernels to process large datasets efficiently. In contrast to simple kernels where each thread processes only one element, grid stride loops enable threads to iterate over multiple elements in a dataset. This allows for better utilization of the GPU's parallel processing capabilities while simplifying the handling of datasets that exceed the thread count.

How Grid Stride Loops Work

In CUDA, threads are grouped into blocks, which in turn form a grid. Each thread in the grid has a unique index (idx), which determines the portion of the dataset it processes. However, in scenarios where the dataset size exceeds the total number of threads in the grid, grid stride loops step in.

A grid stride loop ensures that each thread processes elements at regular intervals, defined by the grid stride:

Thread Index: Each thread starts with an index (idx = cuda.grid(1)).
Grid Stride: The stride is the total number of threads in the grid (stride = cuda.gridsize(1)).
Looping: Threads iterate over the dataset, processing every strideth element.

Here's a simple example of a grid stride loop in a CUDA kernel:

from numba import cuda

@cuda.jit
def add_kernel(x, y, out):
    idx = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(idx, x.size, stride):
        out[i] = x[i] + y[i]

Benefits of Grid Stride Loops

Flexibility: Grid stride loops adapt to any dataset size without requiring specific grid or block configurations.
Memory Coalescing: By processing consecutive elements in memory, threads improve memory access efficiency.
Scalability: They allow kernels to utilize all available GPU resources effectively, even for very large datasets.

A Practical Example: Hypotenuse Calculation

Consider calculating the hypotenuse for pairs of numbers stored in arrays. Using a grid stride loop, the kernel can process arrays of arbitrary size:

from numba import cuda
from math import hypot
import numpy as np

@cuda.jit
def hypot_stride(a, b, c):
    idx = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(idx, a.size, stride):
        c[i] = hypot(a[i], b[i])

# Initialize data
n = 1000000
a = np.random.uniform(-10, 10, n).astype(np.float32)
b = np.random.uniform(-10, 10, n).astype(np.float32)
c = np.zeros_like(a)

# Transfer to GPU
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.device_array_like(c)

# Kernel launch
threads_per_block = 128
blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
hypot_stride[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

# Retrieve results
result = d_c.copy_to_host()

This approach ensures that all elements in the arrays are processed efficiently, regardless of their size.

Conclusion

Grid stride loops are a cornerstone of efficient CUDA programming, enabling developers to handle datasets that exceed the capacity of a single grid. By combining grid stride loops with techniques like memory coalescing and atomic operations, you can harness the full power of the GPU for high-performance data processing.