What Is PyTorch?

PyTorch is an open-source machine learning framework that is known for its flexibility, ease of use, and performance in modern AI applications. This is enabled in part by its compatibility with the popular Python high-level programming language favored by machine learning developers and AI researchers.

How Does PyTorch Work?

PyTorch supports the two main AI workloads: training and inference.

During training, PyTorch aids in teaching a neural network by orchestrating a continuous cycle where input data, stored as Tensors, flows through the model to generate predictions. The framework’s defining feature is its automatic differentiation engine, autograd, which automatically calculates exactly how the model’s parameters need to change to reduce errors (backpropagation). This process involves massive computational workloads that are parallelized across NVIDIA GPUs, allowing the model to iterate and improve its accuracy at scale.

Once the model is trained, it enters the AI inference phase, where it applies its learned patterns to process new, real-world data. In this stage, PyTorch shifts focus from learning to execution, allowing the model to generate content or make predictions. This streamlined performance is critical for deployment, enabling developers to take complex generative AI models and run them efficiently on accelerated computing hardware to deliver real-time, low-latency responses.

Tensors

Tensors are the fundamental building block of PyTorch. Similar to multidimensional arrays or NumPy’s ndarrays, tensors store and manipulate model inputs, outputs, and parameters. Crucially, PyTorch tensors are designed to run on NVIDIA GPUs, enabling massive parallel computation to accelerate training and inference.

Graphs

Neural networks transform input data by applying nested functions to parameters (weights and biases). The goal is to optimize these parameters by computing gradients (partial derivatives) via backpropagation.

Historically, PyTorch used dynamic computation graphs (Eager Mode), building the graph on the fly as code executed. This “define-by-run” approach made debugging easy and handled dynamic structures perfectly.

With the release of PyTorch 2.0, the framework introduced torch.compile. This allows users to capture the graph into a static structure when needed, optimizing it for performance on NVIDIA GPUs without changing the underlying model code. This brings the best of both worlds: the flexibility of dynamic graphs for development and the speed of static graphs for production.

Accelerating Large-Scale Mixture-of-Experts Training in PyTorch

Built on accelerated PyTorch distributed with NVIDIA performance optimizations, NeMo™ Automodel democratizes large-scale MoE training.

Learn More

Quick Links

Double PyTorch Inference Speed for Diffusion Models Using Torch-TensorRT

Scale Biology Transformer Models With PyTorch and NVIDIA BioNeMo™ Recipes

Open Source AI Week—How Developers and Contributors Are Advancing AI Innovation

Tensors

Tensors are the fundamental building block of PyTorch. Similar to multidimensional arrays or NumPy’s ndarrays, tensors store and manipulate model inputs, outputs, and parameters. Crucially, PyTorch tensors are designed to run on NVIDIA GPUs, enabling massive parallel computation to accelerate training and inference.

Graphs

Why PyTorch?

PyTorch has evolved from a research-focused framework into the industry standard for generative AI and production deployment. Originally developed by Meta AI, it is now governed by the independent PyTorch Foundation (part of the Linux Foundation), with NVIDIA as a founding premier member.

Image reference https://pytorch.org/features/

Flexible and Fast: PyTorch is built on an intuitive Python frontend that focuses on readability and rapid iteration. However, modern PyTorch is also built for speed. With features like TorchScript and TorchDynamo, developers can seamlessly transition from eager experimentation to high-performance production deployment.

The Generative AI Standard: PyTorch is the native language of the generative AI revolution. It is the framework of choice for building large language models (LLMs) and diffusion models due to its support for distributed training across thousands of GPUs using FSDP (Fully Sharded Data Parallel).

Developer Ecosystem: The PyTorch API has remained consistent and user-friendly, making it accessible for beginners while offering deep control for experts. Its massive ecosystem includes libraries for computer vision, natural language processing (NLP), and reinforcement learning, ensuring that if a new AI technique exists, there is likely already a PyTorch implementation for it.

What Is PyTorch Used For?

The PyTorch framework is the engine behind the world’s most advanced AI, covering computer vision, reinforcement learning, and specifically, the boom in generative AI.

Generative AI and Natural Language Processing

From real-time translation to intelligent coding assistants, PyTorch powers modern NLP. While earlier approaches relied on recurrent neural networks (RNNs), the field has shifted to Transformer architectures. PyTorch provides the essential building blocks for training Transformers, enabling the creation of massive foundational models like gpt-oss and NVIDIA Nemotron™. These models understand context, generate text, and reason across complex domains.

Computer Vision and Multimodal AI

PyTorch remains the standard for foundational computer vision tasks like object detection and segmentation, but it is also the engine behind the rise of multimodal AI (AI that processes and understands multiple data types such as text, images, audio, and video). By enabling models to process text, images, and audio simultaneously, PyTorch powers the latest visual language models (VLMs) and diffusion models. This allows developers to build systems that can generate photorealistic images from text descriptions or reason about visual data in real time, leveraging NVIDIA GPUs to handle the immense computational requirements of these mixed-modality workloads.

What Are the Benefits of PyTorch?

Performance at Scale

When combined with NVIDIA GPUs, PyTorch delivers massive throughput gains. Features like FlashAttention and Transformer Engine (utilizing NVFP4 precision on NVIDIA Blackwell GPUs) drastically reduce training time and memory usage.

Production Efficiency

For inference, NVIDIA TensorRT™ LLM leverages PyTorch as the default backend. Latest benchmarks show over 10x performance with other compatible optimizations like NVIDIA Dynamo and wide-expert parallelism.

Scalability

PyTorch’s distributed training capabilities allow it to autoscale across thousands of NVIDIA GPUs, essential for training foundation models with billions of parameters.

PyTorch on GPUs

State-of-the-art AI models span billions to trillions of parameters, exposing massive parallelism across dense linear algebra, attention, and communication primitives. Realizing this parallelism efficiently requires extreme codesign across the framework, compiler, kernels, and hardware. PyTorch maps high-level model graphs onto fused GPU kernels via its dispatcher, autograd engine, and compiler stack, while GPUs provide the execution model, memory hierarchy, and interconnects these kernels are designed against. This tight PyTorch–GPU co-evolution is what enables scalable, high-throughput training and inference for modern AI models.

PyTorch With NVIDIA GPUs

NVIDIA GPUs utilize thousands of cores to handle massive parallel workloads simultaneously. Large language models (LLMs), which consist of billions or trillions of parameters, map naturally to this architecture, providing exponentially faster training and inference.

PyTorch with NVIDIA GPUs features best-in-class support for NVIDIA hardware. It provides native CUDA® support (via torch.cuda), allowing developers to move tensors to the GPU with a single line of code.

Optimized Ecosystem:

NVIDIA NGC: Developers can pull optimized PyTorch containers from NGC, preloaded with the latest drivers, libraries, and dependencies.
Native Inference With TensorRT-LLM: PyTorch is now the default backend for TensorRT-LLM, allowing developers to seamlessly compile and optimize LLMs directly from PyTorch checkpoints to unlock state-of-the-art throughput and latency.
Modern Scalable Training: PyTorch 2.0 now natively integrates FP4 execution via NVIDIA’s Transformer Engine and graph compilation (torch.compile) for maximum throughput, while Fully Sharded Data Parallel (FSDP) enables efficient scaling of massive models across thousands of GPUs.

NVIDIA Deep Learning Frameworks and PyTorch

While PyTorch serves as the cornerstone of the modern open source software (OSS) deep learning ecosystem, unlocking its full potential for scalable scientific computing requires a symbiotic relationship with optimized hardware acceleration. The following diagram maps out this critical synergy, illustrating how NVIDIA’s comprehensive stack acts as the engine beneath PyTorch’s flexible interface.

These layers of the stack are essential for high-performance AI: starting from the foundational mathematical primitives of CUDA-X™ (such as cuBLAS and cuDNN), moving up through specialized NVIDIA Deep Learning Frameworks like TensorRT, NeMo, and the Transformer Engine, and finally resting on a foundation of accelerated infrastructure that spans from data center GPUs to cloud environments.

Next Steps

Explore PyTorch Technical Blogs

Explore NVIDIA’s continuously growing repository of PyTorch technical blogs.

Learn More

NVIDIA PyTorch NGC Catalog

The PyTorch NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance.

Learn More

Running PyTorch on NVIDIA Optimized Frameworks

Explore key features, software enhancements and improvements, known issues, and how to run the PyTorch container.

Learn More