LAMMPS WWW Site

LAMMPS Benchmarks

This page lists LAMMPS performance on several benchmark problems, run on various machines, both in serial and parallel and on GPUs. Note that input and sample output files for many of these benchmark tests are provided in the bench directory of the LAMMPS distribution. See the bench/README file for details.

CPU benchmarks

GPU (Kepler) and Intel Xeon Phi benchmarks using all accelerator packages

GPU (Fermi) benchmarks using the GPU and USER-CUDA packages

Interatomic potential benchmarks

Billion atom benchmarks


Thanks to the following individuals for running the various benchmarks:



Machine specifications

These are the parallel machines for which benchmark data is given for the CPU benchmarks below. See the Kokkos, Intel, and GPU sections for machine specifications for those GPU and Phi platforms.

The "Processors" column is the most number of processors on that machine that LAMMPS was run on. Message passing bandwidth and latency is in units of Mb/sec and microsecs at the MPI level, i.e. what a program like LAMMPS sees. More information on machine characteristics, including their "birth" year, is given at the bottom of the page.

Vendor/Machine Processors Site CPU Interconnect Bandwidth Latency
Dell T7500 dual hex-core desktop 12 SNL 3.47 GHz Xeon on-chip ?? ??
Xeon/Myrinet cluster 512 SNL 3.4 GHz dual Xeons (64-bit) Myrinet 230 9
IBM p690+ 512 Daresbury 1.7 GHz Power4+ custom 1450 6
IBM BG/L 65536 LLNL 700 MHz PowerPC 440 custom 150 3
Cray XT3 10000 SNL 2.0 GHz Opteron Cstar 1100 7
Cray XT5 1920 SNL 2.4 GHz Opteron Cstar 1100 7

One-processor timings are also listed for some older machines whose characteristics are also given below.

Name Machine Processors Site CPU Interconnect Bandwidth Latency
Laptop Mac PowerBook 1 SNL 1 GHz G4 PowerPC N/A N/A N/A
ASCI Red Intel 1500 SNL 333 MHz Pentium III custom 310 18
Ross custom Linux cluster 64 SNL 500 MHz DEC Alpha Myrinet 100 65
Liberty HP Linux cluster 64 SNL 3.0 GHz dual Xeons (32-bit) Myrinet 230 9
Cheetah IBM p690 64 ORNL 1.3 GHz Power4 custom 1490 7

Billion-atom LJ timings are also given for GPU clusters, with more characteristics given below.

Name Machine GPUs Site GPU Interconnect Bandwidth Latency
Keeneland Intel/NVIDIA cluster 360 ORNL Tesla M2090 Qlogic QDR ??? ???
Lincoln Intel/NVIDIA cluster 384 NCSA Tesla C1060 Infiniband 1500 12

For each of the 5 benchmarks, fixed- and scaled-size timings are shown in tables and in comparative plots. Fixed-size means that the same problem with 32,000 atoms was run on varying numbers of processors. Scaled-size means that when run on P processors, the number of atoms in the simulation was P times larger than the one-processor run. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms.

All listed CPU times are in seconds for 100 timesteps. Parallel efficiencies refer to the ratio of ideal to actual run time. For example, if perfect speed-up would have given a run-time of 10 seconds, and the actual run time was 12 seconds, then the efficiency is 10/12 or 83.3%. In most cases parallel runs were made on production machines while other jobs were running, which can sometimes degrade performance.

The files needed to run these benchmarks are part of the LAMMPS distribution. If your platform is sufficiently different from the machines listed, you can send your timing results and machine info and we'll add them to this page. Note that the CPU time (in seconds) for a run is what appears in the "Loop time" line of the output log file, e.g.

Loop time of 3.89418 on 8 procs for 100 steps with 32000 atoms 

These benchmarks are meant to span a range of simulation styles and computational expense for interaction forces. Since LAMMPS run time scales roughly linearly in the number of atoms simulated, you can use the timing and parallel efficiency data to estimate the CPU cost for problems you want to run on a given number of processors. As the data below illustrates, fixed-size problems generally have parallel efficiencies of 50% or better so long as the atoms/processor is a few hundred or more. Scaled-size problems generally have parallel efficiencies of 80% or more across a wide range of processor counts.



One processor comparisons

This is a summary of single-processor LAMMPS performance in CPU secs per atom per timestep for the 5 benchmark problems which follow. This is on a Dell Precision T7500 desktop Red Hat linux box with dual hex-core 3.47 GHz Intel Xeon processors, using the Intel 11.1 icc compiler. The ratios indicate that if the atomic LJ system has a normalized cost of 1.0, the bead-spring chains and granular systems run 2x and 4x faster, while the EAM metal and solvated protein models run 2.6x and 16x slower respectively. These differences are primarily due to the expense of computing a particular pairwise force field for a given number of neighbors per atom.

Problem: LJ Chain EAM Chute Rhodopsin
CPU/atom/step: 7.02E-7 3.15E-7 1.85E-6 1.71E-7 1.14E-5
Ratio to LJ: 1.0 0.45 2.64 0.24 16.2

Lennard-Jones liquid benchmark

Input script for this problem.

Atomic fluid:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Polymer chain melt benchmark

Input script for this problem.

Bead-spring polymer melt with 100-mer chains and FENE bonds:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


EAM metallic solid benchmark

Input script for this problem.

Cu metallic solid with embedded atom method (EAM) potential:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Granular chute flow benchmark

Input script for this problem.

Chute flow of packed granular particles with frictional history potential:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Rhodopsin protein benchmark

Input script for this problem.

All-atom rhodopsin protein in solvated lipid bilayer with CHARMM force field, long-range Coulombics via PPPM (particle-particle particle mesh), SHAKE constraints. This model contains counter-ions and a reduced amount of water to make a 32K atom system:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.



Accelerator benchmarks on a GPU cluster with Kepler GPUs

This section benchmarks all the accelerator packages available in LAMMPS on the same machine: GPU, KOKKOS, OPT, USER-CUDA, USER-INTEL, USER-OMP. Unaccelerated runs (using the standard LAMMPS styles) are also included for reference.

This section shows performance results for a GPU cluster at Sandia National Labs called "shannon". It has 32 nodes, each with two 8-core Sandy Bridge Xeon CPUs (E5-2670, 2.6GHz, HT deactivated), for a total of 512 cores. Twenty-four of the nodes have two NVIDIA Kepler GPUs (K20x, 2688 732 MHz cores). LAMMPS was compiled with the Intel icc compiler, using module openmpi/1.8.1/intel/13.1.SP1.106/cuda/6.0.37.

The benchmark problems themselves are described in more detail above in the CPU section. The input scripts and instructions for running these test cases are included in the bench/KEPLER directory of the LAMMPS distribution.

Lennard-Jones liquid

The first set of 5 plots are for running problmes of various sizes (2K to 8192K atoms) on a single node (16 cores, 1 or 2 GPUs). Note that the y-axis scale is not the same in the various plots. Click on the plots for a larger version.

The corresponding raw timing data for the 5 plots is in these 5 files. The 1st column is the x-axis, the 2nd is the time for 100 steps; the 3rd is the y-axis.

The first plot shows the performance of all 6 accelerator packages (two variants for KOKKOS). The GPU, USER-CUDA, and KOKKOS/CUDA curves run primarily on the 2 GPUs. They perform better the more atoms that are modeled. All the other curves run on the 16 CPU cores. All of these performed best using essentially MPI-only parallelization (single thread per MPI task if multi-threading). The CPU curve is for running without enabling any of the accelerator packages. All the runs were in double precision.

The second plot shows the same data as the first (on a different y-axis scale), but only for the runs on the CPU cores. The Intel package has an option to perform pairwise calculations in double, mixed, or single precision, so 3 curves are shown for it.

The third plot also shows the same data as the first, but only for the runs on the GPUs. For each of the 3 packages, runs were made on one or two GPUs. All the runs were in double precision.

The fourth plot shows the performance effect of running the 3 GPU-based packages with varying precision (double, mixed, single). The KOKKOS package only currently allows for double precision calculations.

The fifth plot shows the effect on KOKKOS GPU performance of transferring data back and forth between the CPUs and GPUs every timestep. The benchmark problem in the previous 4 plots could run continuously on the GPU (e.g. between occasional thermodynamic output in a production run). But a LAMMPS input script may require more frequent communication back to the CPU. E.g. if a diagnostic is invoked that runs on the CPU or some form of output is triggered every few timesteps. Or if a fix or compute style is used that is not yet KOKKOS-enabled. The benchmark run for this plot required data to move back-and-forth every timestep (worst case). The curve for the GPU package shows this performance hit may be possible to partially overcome, since the GPU package also moves data back and forth every step to perform time integration on the CPU.

The next 2 plots are for parallel runs on 1 to 20 nodes of the cluster. The strong scaling plot is for runs of a 2048K atom problem. The weak scaling plot is for runs with 512K atoms per node. Note that performance (y-axis) is normalized by the number of nodes. Thus in both plots, horizontal lines would be 100% parallel efficiency. Note that for the strong scaling curves on GPUs, there is an pseudo-efficieny loss when running on more nodes due to shrinking the number of atoms per node and thus moving to the left on the single-node GPU performance curves in the plots above.

The corresponding raw timing data is in these 2 files. The 1st column is the x-axis, the 2nd is the time for 100 steps; the 3rd is the y-axis.



GPU and USER-CUDA package benchmarks on Desktop system with Fermi GPUs

This section shows performance results for a desktop system with dual hex-core Xeon processors and 2 NVIDIA Tesla/Fermi GPUs. More system details are given below, for the "Desktop" entry.

The benchmark problems themselves are described in more detail above in the CPU section. The input scripts and instructions for running these GPU test cases are included in the bench/FERMI directory of the LAMMPS distribution.

The performance is plotted as a function of system size (number of atoms), where the size of the benchmark problems was varied. The Y-axis is atom-timesteps per second. Thus a y-axis value of 10 million for a 1M atom system means it ran at a rate of 10 timesteps/second.

Results are shown for running in CPU-only mode, and on 1 or 2 GPUs, using either of the GPU or USER-CUDA packages.

The CPU-only results are for running on a single core and on all 12 cores, always in double precision.

For the GPU package, the number of CPU cores/node used was whatever gave the fastest performance for a particular problem size. For small problems this is typically less than all 12; for large problems it is typically all 12. The precision refers to the portion of the calculation performed on the GPU (pairwise interactions). Results are shown for single precision, double precision, and mixed precision which means pairwise interactions calculated in single precision, with the aggregate per-atom force accumulated in double precision.

For the USER-CUDA package, the number of CPU cores used is always equal to the number of GPUs used, i.e. 1 or 2 for this system. The three precisions have the same meaning as for the GPU package, except that other portions of the calculation are also performed on the GPU, e.g. time integration.

Click on the plots for a larger version.


GPU package benchmarks on Titan development supercomputer with Fermi GPUs

This section shows performance results for the Titan development system, each node of which has a 16-core AMD CPU and a single NVIDIA Tesla/Fermi GPU. More system details are given below, for the "Titan Development" entry. Note that the eventual Titan machine will have Tesla/Kepler GPUs, and more of them.

The benchmark problems themselves are described in more detail above in the CPU section. The input scripts and instructions for running these GPU test cases are included in the bench/FERMI directory of the LAMMPS distribution.

For the rhodopsin benchmark, which computes long-range Coulombics via the PPPM option of the kspace_style command, these benchmarks were run with the run_style verlet/split command, to split the real-space versus K-space computations across the CPUs. This makes little difference on small node counts, but on large node counts, it enables better scaling, since the FFTs computed by PPPM are performed on fewer processors. This was done for both the strong- and weak-scaling results below. The ratio of real-to-kspace processors was chosen to give the best performance, and was 7:1 on this 16 core/node machine.

For the strong-scaling plots, a fixed-size problem of 256,000 atoms was run for all node counts. The node count varied from 1 to 128, or 16 to 2048 cores. The Y-axis is atom-timesteps per second. Thus a value of 10 for the 256,000 atom system means it ran at a rate of roughly 40 timesteps/second.

Strong-scaling results are shown for running in CPU-only mode, and on the GPU, using the GPU package. The CPU-only results are double-precision, the GPU results are for mixed precision which means pairwise interactions calculated in single precision, with the aggregate per-atom force accumulated in double precision. The dotted line indicates the slope for perfect scalability.

For the GPU package, the number of CPU cores/node used was whatever gave the fastest performance for a particular problem size. For the strong-scaling results with the large per-node atom count (256000), this was typically nearly all 16 cores.

Click on the plots for a larger version.

For the weak-scaling plots, a scaled-size problem of 32,000 atoms/node was run for all node counts. The node count varied from 1 to 8192, or 16 to 131072 cores; only 960 nodes on the current development machine have GPUs. Thus the largest system on 8192 nodes has ~262 million atoms.

The Y-axis is atom-timesteps per second. Thus a value of 100 for a 1M atom system (32 nodes) means it ran at a rate of 100 timesteps/second.

Weak-scaling results are shown for running in CPU-only mode, in CPU-only mode with the numa option invoked for the processors command, and on the GPU, using the GPU package. The CPU-only and NUMA results are double-precision, the GPU results are for mixed precision which means pairwise interactions calculated in single precision, with the aggregate per-atom force accumulated in double precision. The dotted line indicates the slope for perfect scalability.

The NUMA results alter the layout of cores to the logical 3d grid of processors that overlays the simulation domain. The processors numa command does this so that cores within a node and within a NUMA region (inside the node) are close together in the topology of the 3d grid, to reduce off-node communication costs. This can give a speed-up of 10-15% on large node counts, as shown in the plots.

For the GPU package, the number of CPU cores/node used was whatever gave the fastest performance for a particular problem size. For the weak-scaling results with the smaller per-node atom count (32000), this was typically 4-8 cores out of 16.

Click on the plots for a larger version.



Interatomic potential comparisons

The following table summarizes the CPU cost of various potentials, as implemented in LAMMPS, each for a system commonly modeled by that potential. The desktop machine these were run on is described below. The last 3 entries are for VASP timings, to give a comparison with DFT calculations. The details for the VASP runs are described below.

The listed timing is CPU seconds per timestep per atom for a one processor (core) run. Note that this is per timestep, as is the ratio to LJ; the timestep size is listed in the table. In each case a short 100-step run of a roughly 32000 atom system was performed. The speed-up is for a 4-processor run of the same 32000-atom system. Speed-ups greater than 4x are due to cache effects.

To first order, the CPU and memory cost for simulations with all these potentials scales linearly with the number of atoms N, and inversely with the number of processors P when running in parallel. This assumes the density doesn't change so that the neighbors per atom stays constant as you change N. This holds for N/P ratios larger than some threshhold, say 1000 atoms per processor. Thus you can use this data to estimate the run-time of different size problems on varying numbers of processors.

Potential System # Atoms Timestep Neighs/atom Memory CPU LJ Ratio P=4 Speed-up Input script Tarball
Granular chute flow 32000 0.0001 tau 7.2 33 Mb 2.08e-7 0.26x 4.28x in.granular bench_granular.tar.gz
FENE bead/spring polymer melt 32000 0.012 tau 9.7 8.4 Mb 2.86e-7 0.36x 3.78x in.fene bench_fene.tar.gz
Lennard-Jones LJ liquid 32000 0.005 tau 76.9 12 Mb 8.01e-7 1.0x 3.56x in.lj bench_lj.tar.gz
DPD pure solvent 32000 0.04 tau 41.3 9.4 Mb 1.22e-6 1.53x 3.54x in.dpd bench_dpd.tar.gz
EAM bulk Cu 32000 5 fmsec 75.5 13 Mb 1.87e-6 2.34x 3.83x in.eam bench_eam.tar.gz
REBO polyethylene 32640 0.5 fmsec 149 33 Mb 3.18e-6 3.97x 3.61x in.rebo bench_rebo.tar.gz
Stillinger-Weber bulk Si 32000 1 fmsec 30.0 11 Mb 3.28e-6 4.10x 3.83x in.sw bench_sw.tar.gz
Tersoff bulk Si 32000 1 fmsec 16.6 9.2 Mb 3.74e-6 4.67x 3.92x in.tersoff bench_tersoff.tar.gz
ADP bulk Ni 32000 5 fmsec 83.6 25 Mb 5.58e-6 6.97x 3.61x in.adp bench_adp.tar.gz
EIM crystalline NaCl 32000 0.5 fmsec 98.9 14 Mb 5.60e-6 6.99x 3.86x in.eim bench_eim.tar.gz
Peridynamics glass fracture 32000 22.2 nsec 422 144 Mb 7.46e-6 9.31x 3.78x in.peri bench_peri.tar.gz
SPC/E liquid water 36000 2 fmsec 700 86 Mb 8.77e-6 11.0x 3.46x in.spce bench_spce.tar.gz
CHARMM + PPPM solvated protein 32000 2 fmsec 376 124 Mb 1.13e-5 14.1x 3.66x in.protein bench_protein.tar.gz
MEAM bulk Ni 32000 5 fmsec 48.8 54 Mb 1.32e-5 16.5x 3.73x in.meam bench_meam.tar.gz
Gay-Berne ellipsoid mixture 32768 0.002 tau 140 21 Mb 2.20e-5 27.5x 3.63x in.gb bench_gb.tar.gz
BOP bulk CdTe 32000 1 fmsec 4.4 74 Mb 2.51e-5 31.3x 3.88x in.bop bench_bop.tar.gz
AIREBO polyethylene 32640 0.5 fmsec 681 101 Mb 3.25e-5 40.6x 3.66x in.airebo bench_airebo.tar.gz
ReaxFF/C PETN crystal 32480 0.1 fmsec 667 976 Mb 1.09e-4 136x 3.17x in.reaxc bench_reaxc.tar.gz
COMB crystalline SiO2 32400 0.2 fmsec 572 85 Mb 2.00e-4 250x 3.89x in.comb bench_comb.tar.gz
eFF H plasma 32000 0.001 fmsec 5066 365 Mb 2.16e-4 270x 3.71x in.eff bench_eff.tar.gz
ReaxFF PETN crystal 16240 0.1 fmsec 667 425 Mb 2.84e-4 354x 3.78x in.reax bench_reax.tar.gz
VASP/small water 192/512 0.3 fmsec N/A 320 procs 26.2 17.7e6 100% N/A N/A
VASP/medium CO2 192/1024 0.8 fmsec N/A 384 procs 252 170e6 100% N/A N/A
VASP/large Xe 432/3456 2.0 fmsec N/A 384 procs 1344 908e6 100% N/A N/A

Notes:

Details for different systems:



Billion-atom LJ benchmarks

The Lennard-Jones benchmark problem described above (100 timesteps, reduced density of 0.8442, 2.5 sigma cutoff, etc) has been run on different machines for billion-atom tests. For the LJ benchmark LAMMPS requires a little less than 1/2 Terabyte of memory per billion atoms, which is used mostly for neighbor lists.

Machine # of Atoms Processors CPU Time (secs) Parallel Efficiency Flop Rate Date
Keeneland 1 million 1 GPU 2.35 100% 27.0 Gflop 2012
Keeneland 1 billion 288 GPUs 17.7 46.3% 3.60 Tflop 2012
Lincoln 1 million 1 GPU 4.24 100% 15.0 Gflop 2011
Lincoln 1 billion 288 GPUs 28.7 51.3% 2.21 Tflop 2011
Cray XT5 1 million 1 148.7 100% 427 Mflop 2011
Cray XT5 1 billion 1920 103.0 75.1% 616 Gflop 2011
Cray XT3 1 million 1 235.3 100% 270 MFlop 2006
Cray XT3 1 billion 10000 25.1 93.6% 2.53 Tflop 2006
Cray XT3 10 billion 10000 246.8 95.2% 2.57 Tflop 2006
Cray XT3 40 billion 10000 979.0 96.0% 2.59 Tflop 2006
IBM BG/L 1 million 1 898.3 100% 70.7 Mflop 2005
IBM BG/L 1 billion 4096 227.6 96.3% 279 Gflop 2005
IBM BG/L 1 billion 32K 30.2 90.7% 2.10 Tflop 2005
IBM BG/L 1 billion 64K 16.0 85.6% 3.97 Tflop 2005
IBM BG/L 10 billion 64K 148.9 92.0% 4.26 Tflop 2005
IBM BG/L 40 billion 64K 585.4 93.6% 4.34 Tflop 2005
ASCI Red 32000 1 62.88 100% 32.3 Mflop 2004
ASCI Red 750 million 1500 1156 85.0% 41.2 Gflop 2004

The parallel efficiencies are estimated from the per-atom CPU or GPU time for a large single processor (or GPU) run on each machine:

The aggregate flop rate is estimated using the following values for the pairwise interactions, which dominate the run time:

This is a conservative estimate in the sense that flops computed for atom pairs outside the force cutoff, building neighbor lists, and time integration are not counted. For the USER-CUDA package running on GPUs, Newton's 3rd law is not used (because it's faster not to), which doubles the pairwise interaction count, but that is not included in the flop rate either.



Machines

This section lists characteristics of machines used in the benchmarking along with options used in compiling LAMMPS. The communication parameters are for bandwidth and latency at the MPI level, i.e. what a program like LAMMPS sees.

Desktop = Dell Precision T7500 desktop workstation running Red Hat linux

Mac laptop = PowerBook G4 running OS X 10.3

ASCI Red = ASCI Intel Tflops MPP

Ross = CPlant DEC Alpha/Myrinet cluster

Liberty = Intel/Myrinet cluster packaged by HP

Cheetah = IBM p690 cluster

Xeon/Myrinet cluster = Spirit

IBM p690+ cluster = HPCx

IBM BG/L = Blue Gene Light

Cray XT3 = Red Storm

Cray XT5 = xtp

Lincoln = GPU cluster

Keeneland = GPU cluster

Titan Development = GPU-enabled supercomputer (used to be Jaguar)