Introduction to CUDA Python with Numba

Notes for NVIDIA course on cuda with python with numba

Introduction to CUDA python with Numba.

contents in this chapter

  • compiling the functions for CPU and receiving the inner workings of the Numba compiler.

  • Learn how GPU accelerates elementwise Numpy array functions along with effectively moving data between CPU host to GPU device.

The expected outcome from the chapter

  • GPU accelerates python code that performs the elementwise operations on numpy arrays

What is CUDA;

it's the compute platform enabling the application acceleration by enabling developers to execute code in a massively parallel fashion on NVIDIA GPUs.

What is numba

it's the python function compiler which has a simple interface accelerating focused on python functions. Numba can be used to accelerate the python functions on the CPU as well as NVIDIA's GPUs

  • Function compiler: compiles the python functions, not entire applications and not the parts of the functions. Turning the python function into faster python functions.

  • Type specializing: speeds up the function by generating the specialized implementation of the data types using python functions are designed to operate on generic data types which makes them flexible but slow

  • Just in time: translate the functions when they are called this ensures the compiler knows what argument types you'll be using. used interactively in the jupyter notebooks and as easily as the traditional application

  • Numerically focused: focused on the numerical data types like int float and complex. this is the very limited string processing support.

Compiling the code for CPU:

Numba compiler can be enabled by applying the function decorator to python functions. These are the functions that transform the python function decorator python @jit

from numba import jit 
import math 

# wrapping the function with numba jit decorator for optimizing cpu code
@jit
def hypot(x,y):
    x = abs(x)
    y = abs(y)
    t = min(x,y)
    x = max(x,y)
    t = t/x 
    return x * math.sqrt(1+t*t)

Trying out numba with monte carlo pi simulation

Numba cannot optimize the object type of code i.e python dictionary it'll run but it'll not be optimized it'll run in the python way.

Using Numba for GPU with numpy universal functions (ufuncs)

universal functions are add function and another standard numpy array inbuilt functions

## Making ufuncs for the GPU

Numba has the ability to create compiled ufuncs, typically a not-so-straightforward process involving C code. With Numba you simply implement a scalar function to be performed on all the inputs, decorate it with @vectorize, and Numba will figure out the broadcast rules for you. For those of you familiar with NumPy's vectorize, Numba's vectorize decorator will be very familiar.

for applying the scalar operation on the number code we can use vectorize a decorator that will optimize the scalar operation and broadcasting on the CPU.

numba vectorize can be used for broadcasting for adding constant value, multiplying, and much more.

Vectorize on GPU

@vectorize(['int64(int64, int64)'], target='cuda') # Type signature and target are required for the GPU
def add_ufunc(x, y):
    return x + y

For such a simple function call, a lot of things just happened! Numba just automatically:

  • Compiled a CUDA kernel to execute the ufunc operation in parallel over all the input elements.

  • Allocated GPU memory for the inputs and the output.

  • Copied the input data to the GPU.

  • Executed the CUDA kernel (GPU function) with the correct kernel dimensions given the input sizes.

  • Copied the result back from the GPU to the CPU.

  • Returned the result as a NumPy array on the host.

Misuse of the GPU

misused the GPU in several ways in this example. How we have misused the GPU will help clarify what kinds of problems are well-suited for GPU computing, and which are best left to be performed on the CPU:

  • Our inputs are too small: the GPU achieves performance through parallelism, operating on thousands of values at once. Our test inputs have only 4 and 16 integers, respectively. We need a much larger array to even keep the GPU busy.

  • Our calculation is too simple: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU. If our calculation does not involve enough math operations (often called "arithmetic intensity"), then the GPU will spend most of its time waiting for data to move around.

  • We copy the data to and from the GPU: While in some scenarios, paying the cost of copying data to and from the GPU can be worth it for a single function, often it will be preferred to to run several GPU operations in sequence. In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.

  • Our data types are larger than necessary: Our example uses int64 when we probably don't need it. Scalar code using data types that are 32 and 64-bit run basically the same speed on the CPU, and for integer types the difference may not be drastic, but 64-bit floating point data types may have a significant performance cost on the GPU, depending on the GPU type. Basic arithmetic on 64-bit floats can be anywhere from 2x (Pascal-architecture Tesla) to 24x (Maxwell-architecture GeForce) slower than 32-bit floats. If you are using more modern GPUs (Volta, Turing, Ampere), then this could be far less of a concern. NumPy defaults to 64-bit data types when creating arrays, so it is important to set the dtype attribute or use the ndarray.astype() method to pick 32-bit types when you need them.

Sending the numpy arrays to GPU:

from numba import cuda 
x_device = cuda.to_device(x)

# for displaying the output to the cpu instead of copying everytime with cuda 
out_device = cuda.device_array(shape(n,), dtype = np.float32) 

Last updated