TESLA

Subscribe
GPU-ACCELERATED GROMACS
The fastest, easiest way to improve simulation performance by up to 3X.
GPU- ACCELERATED GROMACS

How to Run Jobs in GROMACS

GROMACS provides scripts to setup the environment for different shells. To set up the PATH and other environment, the command below can be used.

$ source <GROMACS_INSTALL_DIR>/bin/GMXRC

If you built both MPI and non- MPI binaries as described above, you will find both “gmx” and “gmx_mpi” in your installation's bin folder. The examples described in this guide use “water” data sets from GROMACS ftp. Running GROMACS with these data sets first requires a data preparation step described below. Additional details/options related to preparing input for running GROMACS can be found here.

To run GROMACS two steps are required:

Step 1. Prepare input with grompp (GROMACS preprocessor)
a. In case of the single node version: $ gmx grompp -f <mdp-file>

Step 2. Start mdrun
a. In case of the single node version $ gmx mdrun
b. In case of the MPI version (np = #GPUs): $ mpirun –np <np> gmx_mpi mdrun

For small node counts, these settings usually deliver good performance. However, some tuning will typically improve simulation performance of GROMACS. This hand tuning becomes more important for higher node counts. Please consult the GROMACS manual and the gmx-users mailing list, and published papers for details.

The GPU- related performance tuning options are described in this document. Input sets for benchmarking are available for download.

Benchmarks Execution Options

When running GROMACS benchmarks to measure the performance, use the following command line options:

  1. –resethway: At the start of each simulation, GROMACS tunes the domain decomposition and balances the load between available CPUs and GPUs. This slows down the first few hundred iterations. Because real simulations run for a very long time this does not have any impact on achieved performance. To minimize the runtime needed to obtain stable results while benchmarking the option –resethway should be specified. -resethway resets all performance counters when half of the iterations are executed and thus allows to measure realistic performance without many time steps. Note that if the reset occurrs while the PME load balancer is still active in the beginning of the run, the following error may occur "PME tuning was still active when attempting to reset mdrun counters at step xxxxxxx". To avoid this you can increase the running time steps by increasing the -maxh used or adding -nsteps parameter to increase the number of timesteps of the simulation.
  2. -maxh: Controls the maximum time the simulation should run. mdrun executes enough time steps to run at least for the specified time in hours. This should be set to a high enough value to obtain stable performance results. A reasonable value would typically be 5 Minutes = 0.08333. Either this option or the "nsteps" option described below can be used to limit the running time of the simulation.
  3. –noconfout: This disables the output of confout.gro which might take a considerable amount of time e. G. on parrallel file systems. As this is done very infrequent in real simulations, this should be disabled during benchmarking.
  4. –v: Print out more information to the command line and the produced log file md.log. The contained information is very helpful to tune the performance of GROMACS.
  5. –nb: This tells GROMACS to use either "gpu" or "cpu" for certain calculations
  6. –nsteps: number of timesteps to run. Overrides the default value in the mdp file. This option can also be used to control the overall time that a simulation will run instead of using the maxh.

The Performance is reported at the end of the produced log file (md.log) as well as on the console output in ns/day (higher is better). Visit GROMACS documentation page to get additional details about the command line parameters.

Visit GROMACS documentation page to get additional details about the command line parameters.

Tuning Performance

To understand the performance behavior of GROMACS it is helpful to have a basic understanding of the tasks GROMACS executes. The following is a simplified view of this. From a high level perspective GROMACS executes four tasks:

  1. PP: Calculate short range non-bonded forces or particle-particle (PP) interactions (compute bound only nearest neighbor communication)
  2. PME: Calculate an approximation for the long range part of long range non-bonded forces (communication intensive in multi node runs)
  3. Bonded (B): Compute bonded forces
  4. Other: Enforce bonded constraints, Advance atom positions, compute neighbor lists and others.

The GPU version of GROMACS accelerates the most time consuming task PP on the GPU while the three other tasks PME, Bonded and Other can be only executed on the CPU. The task PP is independent of the Bonded and the PME task and can be executed concurrently while the task Other mostly depends on the output of PP, PME and Bonded. Therefore the GPU version does the PME and Bonded tasks on the CPU while the GPU executes the PP task:

GROMACS Offloads Compute-heavy PP tasks on GPUs

By scaling the electrostatics cutoff (distance above which a force is handled by the long range part) GROMACS can shift work from the CPU (PME task) to the GPU (PP task) (automatically enabled can be disabled with the –tunepme command line switch). Due to accuracy restrictions this is not possible the other way around. E.g. in case the GPU completes the PP task quicker than the CPU (see Figure below) GROMACS shift work from the CPU to the GPU which results in a shorter application runtime.

GROMACS Offloads Compute-heavy PP tasks on GPUs

The PP task has a complexity of O(m*n) and the PME task of O(n log(n)), where n is the number of particles and m is the size of the neighbor list. Typically values for m are 200 – 400 which is much smaller than n.

The following settings can be used to tune the performance of GROMACS.

Launch configuration

GROMACS uses MPI and OpenMP to be able to utilize all available GPU and CPU resources in a cluster. Depending on the input set and the capability of the network different launch configurations are optimal. These range from one MPI process per logical core with one OpenMP thread each to one MPI process per node that uses all available cores in that node. To choose the best launch configuration the following guidelines can be used:

  • If only single (or sometimes dual) CPU socket is used OpenMP-parallelization is usually more efficient than MPI
  • If multiple CPU sockets or nodes are used MPI and OpenMP (hybrid) parallelization using 2-4 OpenMP threads per MPI rank is usually more efficient than only using MPI
  • With large number of nodes it might be beneficial to use more OpenMP threads per MPI rank to reduce the amount of required communication. The amount of time spent in communication is reported in the output file md.log (see Script 1 Time accounting for communication in md.log).
Computing: Num
Ranks
Num
Threads
Call
Count
Wall time
(s)
Giga-Cycles total sum %
[…]
Comm. coord. 2 10252681 63.4673808.202 3.6
[…]
Comm. energies. 2 1025917 1.108 66.457 0.1
Total 1782.537 106957.655 100.0
Breakdown of PME mesh computation
[…]
PME 3D-FFT Comm. 2 10518322 126.829 7610.103 7.1
[…]

Script 1 Time accounting for communication in md.log

For the MPI version the number of MPI ranks can be controlled with the launcher of the used MPI implementation (-np parameter) the number of OpenMP threads can be controlled with the mdrun command line option –ntomp or the environment variable OMP_NUM_THREADS. The single node version of mdrun uses thread-MPI (an internal, threading-based MPI implementation) and OpenMP. For this version the number of OpenMP threads can be also controlled with the command line option –ntomp and the number of MPI ranks can be controlled with –ntmpi. For the GPU accelerated version of GROMACS, it is necessary to have at least one MPI rank per GPU to utilize all available GPUs.

Thread/Process Pinning

GROMACS OpenMP thread and MPI processes should be pinned correctly to the cores/threads of the system. This can be either done by the used mpi launcher/batch system or by GROMACS. Per default GROMACS tries to automatically pin the threads however in case some pinning is already done by the MPI launcher, the OpenMP implementation or the number of logical cores available does not match the number of total threads used on a node this is turned off. In this case pinning can be turned on with the option –pin on if the pinning is not done by the MPI launcher or the batch system.

SMT/Hyperthreading

Using all logical cores in a system (aka SMT or Hyperthreading) usually give a small performance gain.

GPU Boost

The GPU kernels of GROMACS do not exploit the full power and thermal budget of a GPU. It is therefore safe and recommended to increase the GPU clocks to the supported maximum via application clocks. Since GROMACS 5.1 this is done automatically by the gmx executable if application clock permissions are set to UNRESTRICTED so that you should see an output similar to this: Changing GPU clock rates by setting application clocks for Tesla K80 to (2505,875)

In case application clock permissions are set to RESTRICTED GROMACS warns about this with an output similar to this:

Not possible to change GPU clocks to optimal value because of insufficient permissions to set application clocks for Tesla K80. Current values are (2505,562). Max values are (2505,875) Use sudo nvidia-smi -acp UNRESTRICTED or contact your admin to change application clock permissions.

As indicated in the warning message you can use

$ sudo nvidia-smi -acp UNRESTRICTED

to change application clock permissions to UNRESTRICTED.

If your GROMACS build did not include the GDK then GROMACS will indicate a warning similar to this:

GROMACS was configured without NVML support hence it can not exploit
application clocks of the detected Tesla K80 GPU to improve performance.
Recompile with the NVML library (compatible with the driver used) or set application clocks manually.

In this case it is possible to configure the max boost clock manually with command similar to below. Note that specific max clock setting will be different for different GPUs.

$ nvidia-smi -ac 2505,875       # these are max clocks for K80 GPU

For additional details about GPU boost, please refer to Increase Performance with GPU Boost and K80 Autoboost.

Neighbor List Updates (nstlist)

When the verlet cutoff scheme is used (which is required for GPU acceleration) the update interval for the neighbor lists can be chosen freely. If neighbor list search takes a lot of time the update interval for the neighbor lists can be increased. The time for neighbor list search is given in md.log (see Script 2 Time accounting for neighbour list search in md.log). A possibly required increase of the cutoff radius is done automatically. Since the neighbor lists are calculated on the CPU and longer cutoff result in more Force calculation increasing nstlist shifts some work from the CPU to the GPU. Besides the neighbor list computation requires communication between MPI ranks so this also trades communication for computation.

Computing: Num
Nodes
Num
Threads
Call
Count
Wall time
(s)
Giga-Cycles total sum %
[…]
Neighbor search 2 106480 32.464 1947.963 1.8
[…]
Total 1782.537 106957.655 100.0

Script 2 Time accounting for neighbor list search in md.log

Tune PME – PP load balance

[…]
Force evaluation time GPU/CPU: 4.972 ms/5.821 ms = 0.854
For optimal performance this ratio should be close to 1!
[…]

Script 3 PME PP load balance output in md.log

As described above the load between PME and PP nodes can be balanced. This is enabled by default and it can be switched off with the –notunepme parameter. For single process GPU accelerated runs the PP (GPU) / PME (CPU) load balance is reported in the output file md.log (see Script 3 PME PP load balance output in md.log)

Separate PME ranks

Breakdown of PME mesh computation

Breakdown of PME mesh computation
PME redist. X/F 2 10 518322 167.004 10020.730 9.4
PME spread/gather 2 10 518322 429.501 25771.381 24.1
PME 3D-FFT 2 10 518322 123.600 7416.369 6.9
PME 3D-FFT Comm. 2 10 518322 126.829 7610.103 7.1
PME solve Elec 2 10 259161 13.470 808.259 0.8

Script 4 Time accounting for PME task in md.log

Handling the long range non bonded interactions with PME is a communication intensive tasks because it requires to do a 3D FFT. Since the tasks PP and PME are independent GROMACS supports to start so called separated PME nodes (processes which only do the PME task) to reduce pressure on the network in a MPMD fashion. Per default the number of PME ranks is determined automatically only when GPUs are not used. It can be fine-tuned with the parameter –npme. If the communication time of the PME task reported in the output file md.log is high separate PME should be considered. This typically happens with 3-4 or more nodes.

There are two different options to map MPI ranks to PP and PME nodes which are interesting for GPU accelerated runs of GROMACS:

  • The PME nodes are interleaved with the so called PP nodes (PP nodes are executing the task Force) this puts more stress on the network but avoid idle resources. The command line option is –ddorder interleave (default for GPU accelerated runs).
  • The PME nodes are grouped close to each other. This puts less stress on the network but leaves some GPUs idle. The command line option is –ddorder pp_pme (default for CPU only runs).

Sharing a GPU between multiple processes and the Multi Process Service (MPS)

In many situations sharing a GPU between multiple MPI ranks can increases the performance.

For the single node case with non MPI build of GROMACS, this can be supported simply via thread-MPI using the -ntmpi and parameters and simply setting (or allowing GROMACS to set for you) -ntmpi to run 2 or more MPI ranks per GPU.

For multi-node runs with and MPI build of GROMACS, sharing a GPU is possible on GPUs with compute capability 3.5 via the Multi Process Service (MPS). This requires to set the used GPUs to exclusive process mode and start a MPS control daemon on each node (nvidia-cuda-mps-control):

#!/bin/bash
sudo nvidia-smi -c 3
nvidia-cuda-mps-control -d

echo " -started nvidia-cuda-mps-control on `hostname`"

Example script to start MPS on single node

#!/bin/bash
echo quit | nvidia-cuda-mps-control
sudo nvidia-smi -c 0
echo " stopping MPS on `hostname`"

Example script to stop MPS on single node

More information about MPS can be found here.

Advanced Options

There are more advanced domain decomposition settings available. For inhomogeneous systems (e.g. #atoms per domain) tuning these can further improve performance. Please consult the GROMACS Manual or the GROMACS users mailing list for further information.

Disclaimer

GROMACS offers many options to tune performance for processing of different inputs and on different architectures/systems. The instruction set is geared to cover aspects that are important for GPU accelerated GROMACS.

Step-by-Step Examples

A few examples of running GROMACS are described in this section. The input data sets used for these examples are available here.

To get started, download and extract the archive

$ wget ftp://ftp.gromacs.org/pub/benchmarks/water_GMX50_bare.tar.gz
$ tar -zxvf water_GMX50_bare.tar.gz

This data includes example “water” data of different sizes with folder names like 0384, 0768, and 1536. The name of the folder corresponds to the number of atoms in thousands, i.e. the folder 0768 contains a case with 256 thousand water molecules and 768 thousand atoms. The first step is to prepare the data using the grompp tool.

For example, to run case 1536, use the following commands. Use pme.mdp if you want to use Particle Mesh Ewald (PME). Run command below will generate the topol.tpr file which is used as the input to GROMACS mdrun.

$ cd water-cut1.0_GMX50_bare/1536
$ gmx grompp -f pme.mdp

Below are a few different variations for running GROMACS with and without GPUs on 1 or 2 nodes and represent a subset of the cases that were run to collect the example results at the end of this section.

The number of MPI ranks used in the GPU examples depends on the number of GPUs in your system and the number of MPI ranks per GPU. In many cases running more than 1 MPI rank per GPU helps performance. This is possible using the thread-MPI for the single node example and it is also made possible by using CUDA MPS feature when running on multiple nodes. Some additional details on CUDA MPS described in this guide.

It is normally suggested to have 2 to 6 OMP threads for each rank when using GPUs. -ntomp or the variable OMP_NUM_THREADS can be used to specify the number of OMP threads per rank that is desired. Some experimentation will typically be required to adjust the number of MPI ranks vs the number of threads per rank to see what works best for a given system configuration and input data.

In example 1, 2 and 4 below, 2 Tesla K80s are used, which are 4 GPUs (2 K80 boards) per node.

5 examples are running on nodes with 2 haswell CPU sockets

  • 32 cores total
  • hyperthreading disabled

Example 1:

The launch parameters (ntmpi and ntomp) are not specified so GROMACS will automatically decide how many MPI ranks per GPU and how many OMP threads per rank to launch.

Run on a single node with 4 GPUs using thread-MPI and allow GROMACS to determine the launch parameters (ntmpi, ntomp) on it's own.

$ GROMACS_BIN_DIR/gmx mdrun -resethway -noconfout -nsteps 4000 -v -pin on -nb gpu

If you have time to experiment you can try available combinations of MPI ranks per GPU and OMP threads per rank to see what works best. Available options depends on the total number of CPU cores and number of GPUs per node. For example if you have 4 GPUs/node with 32 CPU cores per node you might try combinations below to see what gives the best performance.

-ntmpi 4 -ntomp 8 # 1 rank per GPU with 8 threads per rank
-ntmpi 8 -ntomp 4 # 2 MPI ranks per GPU with 4 threads per rank
-ntmpi 16 -ntomp 2 # 4 MPI ranks er GPU with 2 threads per rank

For the examples 2-5 below, the selected cases provide the best result on the benchmark test system.

Example 2:

Run on a single node with 4 GPUs, 4 OMP threads/rank, 8 ranks, 2 MPI ranks per GPU using thread-MPI

$ GROMACS_BIN_DIR/gmx mdrun -ntmpi 8 -ntomp 4 -resethway -noconfout -nsteps 4000 -v -pin on -nb gpu

Example 3:

Run a single node with CPU cores only, 1 OMP threads/rank, 32 ranks, using thread-MPI

$ $GROMACS_BIN_DIR/gmx mdrun -ntmpi 32 -ntomp 1 -resethway -noconfout -nsteps 4000 -v -pin on -nb cpu

Example 4:

Run on 2 nodes with 4 GPUs/node, 2 OMP threads/rank, 16 ranks per node, 4 MPI ranks per GPU using CUDA MPS to allow multiple MPI ranks per GPU:

$ OMP_NUM_THREADS=2 mpirun -np 32 $GROMACS_BIN_DIR/gmx_mpi mdrun -resethway -noconfout -nb gpu -nsteps 8000 -v -pin on

Example 5:

Run on 2 nodes with CPU cores only, 1 OMP threads/rank, 32 ranks per node

$ OMP_NUM_THREADS=1 mpirun -np 64 $GROMACS_BIN_DIR/gmx_mpi mdrun -resethway -noconfout -nb cpu -nsteps 8000 -v -pin on

See how your Solution
Stacks Up.