The standard for HPC and AI orchestration.
Slurm is an open source workload manager built to efficiently manage nearly any workload and deliver proven throughput at massive scale. It uses a hierarchical structure consisting of a controller, nodes, and partitions to allocate jobs based on policies and resources, optimizing workload distribution, maximizing cluster utilization, and ensuring efficient job execution. Developed and maintained by engineers at SchedMD (now part of NVIDIA) with deep high-performance computing (HPC) and AI expertise, Slurm is the scheduler of choice for over half of the top 100 systems in the TOP500.
Slurm is the market-leading open source workload manager for HPC and AI trusted by many of the world’s largest supercomputing and AI environments.
Slurm allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. It then provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, Slurm arbitrates conflicting requests for resources by managing a queue of pending work.
The workload manager for the world’s top supercomputers.
Slurm is fully open source and hardware agnostic, providing complete transparency and flexibility for resource management and job scheduling. Deploy Slurm, contribute to its growth, and seamlessly integrate it into your infrastructure stack.
Check it out on GitHub and join the community!
The basis of Slurm is to allocate resources, manage pending work, and execute jobs, but it's the details of Slurm's architecture that make it the leading management system for HPC and AI workloads.
Find out how you can manage compute resources using the open source workload manager trusted by research labs and frontier AI leaders.
Managing hundreds of thousands of cores, millions of jobs, and diverse hardware simultaneously requires more than basic scheduling. Slurm handles extreme concurrency with hierarchical job queues, topology-aware routing, and intelligent job packing that maximizes throughput. Built-in power management, policy enforcement, and detailed reporting keep massive deployments running efficiently and accountably at any scale.
When training large AI models or running multi-physics simulations, job placement matters as much as raw compute. Slurm's topology-aware scheduling plans for multi-node workloads on multi-layered interconnects by assigning jobs to nodes that are physically closest in the network fabric, increasing performance by reducing the communication overhead. Combined with GPU-aware and policy-driven resource allocation, teams can run distributed workloads predictably without waiting on lower-priority or poorly placed jobs.
Slinky is a toolkit of components that enables Slurm operation in Kubernetes environments, bridging the gap between traditional HPC and cloud-native environments. Teams can run Slurm and Kubernetes workloads on shared node pools, translating Kubernetes resource requests into Slurm jobs. This gives researchers and developers familiar Kubernetes workflows while benefiting from Slurm's superior batch scheduling and resource governance.
FAQs
An open source workload manager is software that automates the scheduling, execution, and monitoring of computing jobs across shared infrastructure such as clusters or cloud environments. Because it is open source, organizations can freely use, customize, and extend it to fit their performance, scalability, and operational needs without subscriptions or enterprise licenses.
The TOP500 is a ranking of the world's most powerful non-distributed computer systems. Slurm is the scheduler of choice for over half of the top 100 systems in the TOP500 list, which highlights its proven scalability and throughput at massive scale.
Yes, Slurm offers leading-class GPU resource management, allowing users to request both GPU and CPU resources to ensure jobs execute quickly and efficiently while maximizing utilization.
Official quick-start guides for users and administrators, release notes, and other detailed documentation are available on the SchedMD (now part of NVIDIA) website. NVIDIA also provides technical blog posts and on-demand videos related to Slurm integration and features.1
Support tickets can be submitted through the support portal on the SchedMD (now part of NVIDIA) website. An email address with your organization's domain is required to validate your support entitlement. Slurm and Slinky support, training, and consultation services are available from NVIDIA. This provides direct-to-engineering help from experts for implementation and customization.2
Slurm leverages its understanding of complex network and system topologies to enable efficient workload placement on multi-tier interconnects. This minimizes latency, maximizes bandwidth, and improves end-to-end job performance, which is especially critical for HPC and AI training workloads.
SchedMD (now part of NVIDIA) developed Slinky as an open source toolkit of components that enables Slurm operation in Kubernetes environments, bridging the gap between traditional HPC and cloud-native environments. It allows teams to run Slurm and Kubernetes workloads on shared node pools, translating Kubernetes resource requests into Slurm jobs.3
Slurm is optimized for queue-based batch scheduling of large, parallel jobs, prioritizing throughput and hardware efficiency. Kubernetes is designed for declarative, event-driven orchestration of containerized microservices.4