Custom CUDA kernels in python with Numba
contains the notes on part 2 of course introduction to CUDA programming with Numba.
Content covered:
launch arbitrary not just elementwise numerically focused functions in parallel to GPU by writing custom CUDA kernels.
NVIDIA GPU executing the code in parallel
fundamental parallel programming techniques how to coordinate the work of parallel threads and address the race conditions
Outcomes:
writing the custom CUDA kernels in python and launchign them with an execution config
utilizing the grid stride loops for working in parallel over large datasets and leveraging the memory coalescing
using atomic operations to avoid race conditions when working in parallel.
GPU accelerates an incredible range of numerically focused functions on 1D datasets
Custom CUDA kernels
this kernels can run in parallel to the CUDA GPUs. These are more challenging because they offer tremendous flexibility for the types of functions they can send to run in parallel to the GPU.
Introduction to CUDA kernels.
while programming in CUDA develoeprs write the functions for GPU's which are kernels. when they executed in parallel threads. these use special syntax called as execution configuration.
Note:
jit
and cuda.jit
both are different while both offered by numba the @jit and jit from numba will optimize the code for CPU only workflows. Whereas the cuda.jit will optimize the functions for GPUs.
Hiding latency and execution config choices
Cuda-enabled nvidia gpus to consist of streaming multiprocessors, or SMs on a die, with attached DRAM. They contain all the required resources for the execution of kernel code including many CUDA cores. When a kernel is launched each block is assigned to the single SM
Therefore, of primary importance to utilizing the full potential of the GPU, and thereby writing performant accelerated applications, it is essential to give SMs the ability to hide latency by providing them with a sufficient number of warps which can be accomplished most simply by executing kernels with sufficiently large grid and block dimensions.
What is cuda grid?
calculation for cuda.jit will give a unique thread index within the entire grid. this grid will return a single value This numba gives output to cuda.threadIdx.x + cuda.blockIdx.y * cuda.blockDim.x
while CUDA grid might be good choice for running with the small amount of data when running with the large amount of data the no of rids might casue a problem when there is not a full GPU utilization and as there we have to define the size of the block in multiple of 32 and typeical block sizes between 128 to 512 blocks.
what is stride group and when to use stride group
when there is more data elements than there are theads in the grid. In such case the threads ca't run or else the work is left undone. In grid stride group the threda first element is calculated as usual with cuda.grid()then stride forward by the total number of threads in the grid. Numba provides another function for this calculation: cuda.gridsize(), returning the number of threads in the grid. the thread continues this until the data index is greater than number of data elements.
Working on large datasets with Stride Loops
Summary
Write custom CUDA kernels in Python and launch them with an execution configuration.
Utilize grid stride loops for working in parallel over large data sets and leveraging memory coalescing.
Use atomic operations to avoid race conditions when working in parallel.
Last updated