Join us live for the NVIDIA and Colfax joint webinar series kickoff to demystify CUTLASS and learn how to go from fundamentals to optimized implementation, whether you’re a beginner or looking to sharpen your skills.
This webinar constitutes a deep dive into the fundamentals of sound GEMM kernel design on Hopper with CUTLASS, taking the CUTLASS FP8 GEMM kernel with blockwise scaling as a case study. We first cover the techniques of warp-specialization and software pipelining to overlap copy (TMA) and compute (WGMMA), and how to use CUTLASS pipeline objects to ensure correct synchronization. We then go over the CUTLASS tile scheduler abstraction for assigning work to Streaming Multiprocessors. Next, we show how to integrate blockwise and groupwise scaling into the warp-specialized design pattern. Finally, we explain the technique of periodic NVIDIA® CUDA® core accumulation for added precision with FP8 GEMM.
5 Minutes: Introduction
40 Minutes: Topic Presentation
10 Minutes: Question and Answer Session (with NVIDIA moderator)
5 Minutes: Wrap Up, Final Thoughts
This webinar is planned as the first in an ongoing series on teaching CUTLASS. By bringing experts and users together, this webinar series will empower CUTLASS developers to::