Introduction to CUDA Python with Numba
Notes for NVIDIA course on cuda with python with numba
Introduction to CUDA python with Numba.
contents in this chapter
compiling the functions for CPU and receiving the inner workings of the Numba compiler.
Learn how GPU accelerates elementwise Numpy array functions along with effectively moving data between CPU host to GPU device.
The expected outcome from the chapter
GPU accelerates python code that performs the elementwise operations on numpy arrays
What is CUDA;
it's the compute platform enabling the application acceleration by enabling developers to execute code in a massively parallel fashion on NVIDIA GPUs.
What is numba
it's the python function compiler which has a simple interface accelerating focused on python functions. Numba can be used to accelerate the python functions on the CPU as well as NVIDIA's GPUs
Function compiler: compiles the python functions, not entire applications and not the parts of the functions. Turning the python function into faster python functions.
Type specializing: speeds up the function by generating the specialized implementation of the data types using python functions are designed to operate on generic data types which makes them flexible but slow
Just in time: translate the functions when they are called this ensures the compiler knows what argument types you'll be using. used interactively in the jupyter notebooks and as easily as the traditional application
Numerically focused: focused on the numerical data types like int float and complex. this is the very limited string processing support.
Compiling the code for CPU:
Numba compiler can be enabled by applying the function decorator to python functions. These are the functions that transform the python function decorator python @jit
Trying out numba with monte carlo pi simulation
Using Numba for GPU with numpy universal functions (ufuncs)
for applying the scalar operation on the number code we can use vectorize
a decorator that will optimize the scalar operation and broadcasting on the CPU.
Vectorize on GPU
For such a simple function call, a lot of things just happened! Numba just automatically:
Compiled a CUDA kernel to execute the ufunc operation in parallel over all the input elements.
Allocated GPU memory for the inputs and the output.
Copied the input data to the GPU.
Executed the CUDA kernel (GPU function) with the correct kernel dimensions given the input sizes.
Copied the result back from the GPU to the CPU.
Returned the result as a NumPy array on the host.
Misuse of the GPU
misused the GPU in several ways in this example. How we have misused the GPU will help clarify what kinds of problems are well-suited for GPU computing, and which are best left to be performed on the CPU:
Our inputs are too small: the GPU achieves performance through parallelism, operating on thousands of values at once. Our test inputs have only 4 and 16 integers, respectively. We need a much larger array to even keep the GPU busy.
Our calculation is too simple: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU. If our calculation does not involve enough math operations (often called "arithmetic intensity"), then the GPU will spend most of its time waiting for data to move around.
We copy the data to and from the GPU: While in some scenarios, paying the cost of copying data to and from the GPU can be worth it for a single function, often it will be preferred to to run several GPU operations in sequence. In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.
Sending the numpy arrays to GPU:
Course link:
Last updated