This workshop teaches you the fundamental tools and techniques for running GPU-accelerated Python applications using CUDA® and the Numba compiler GPUs. You’ll work through dozens of hands-on coding exercises, and at the end of the training, you’ll implement a new workflow to accelerate a fully functional linear algebra program originally designed for CPUs, observing impressive performance gains. After the workshop ends, you’ll have additional resources to help you create new GPU-accelerated applications on your own.


Learning Objectives

By participating in this workshop, you’ll:
  • Create GPU-accelerated NumPy ufuncs with a few lines of code
  • Configure code parallelization using the CUDA thread hierarchy
  • Write custom CUDA device kernels for maximum performance and flexibility
  • Use memory coalescing and on-device shared memory to increase CUDA kernel bandwidth
  • Generate random numbers on the GPU
  • Learn intermediate GPU memory management techniques

Download workshop datasheet (PDF 298 KB)

Workshop Outline

(15 mins)
  • Meet the instructor.
  • Create an account at
Introduction to CUDA Python with Numba
(120 mins)
  • Begin working with the Numba compiler and CUDA programming in Python.
  • Use Numba decorators to GPU-accelerate numerical Python functions.
  • Optimize host-to-device and device-to-host memory transfers.
Break (60 mins)
Custom CUDA Kernels in Python with Numba
(120 mins)
  • Learn CUDA’s parallel thread hierarchy and how to extend parallel program possibilities.
  • Launch massively parallel custom CUDA kernels on the GPU.
  • Utilize CUDA atomic operations to avoid race conditions during parallel execution.
Break (15 mins)
RNG, Multidimensional Grids, and Shared Memory for CUDA Python with Numba
(120 mins)
  • Use xoroshiro128+ random number generation (RNG) to support GPU-accelerated Monte Carlo methods.
  • Learn multidimensional grid creation and how to work in parallel on 2D matrices.
  • Leverage on-device shared memory to promote memory coalescing while reshaping 2D matrices.
Final Review
(15 mins)
  • Review key learnings and wrap up questions.
  • Complete the assessment to earn a certificate.
  • Take the workshop survey.

Workshop Details

Duration: 8 hours

Price: $500 for public workshops, contact us for enterprise workshops.


  • Basic Python competency, including familiarity with variable types, loops, conditional statements, functions, and array manipulations
  • NumPy competency, including the use of ndarrays and ufuncs
  • No previous knowledge of CUDA programming is required

Technologies: Numba, NumPy

Certificate: Upon successful completion of the assessment, participants will receive an NVIDIA DLI certificate to recognize their subject matter competency and support professional career growth.

Hardware Requirements: Desktop or laptop computer capable of running the latest version of Chrome or Firefox. Each participant will be provided with dedicated access to a fully configured, GPU-accelerated server in the cloud.

Languages: English, Simplified Chinese, Traditional Chinese

Upcoming Workshops

Upcoming Public Workshops

North America / Latin America

Wednesday, July 14, 2021
9:00 a.m.–5:00 p.m. PDT

Europe / Middle East / Africa

Thursday, August 26, 2021
9:00 a.m.–5:00 p.m. CEST

If your organization is interested in boosting and developing key skills in AI, accelerated data science, or accelerated computing, you can request instructor-led training from the NVIDIA DLI.