LAMMPS WWW Site

LAMMPS Benchmarks

This page lists LAMMPS performance on several benchmark problems, run on various machines, both in serial and parallel.

Additional benchmark data is given here:



These are the parallel machines for which benchmark data is given. The "Processors" column is the most number of processors on that machine that LAMMPS was run on. Message passing bandwidth and latency is in units of Mb/sec and microsecs at the MPI level, i.e. what a program like LAMMPS sees. More information on machine characteristics, including their "birth" year, is given below.

Vendor/Machine Processors Site CPU Interconnect Bandwidth Latency
Dell dual quad-core Xeon desktop 8 SNL 2.66 GHz Xeon on-chip ?? ??
Xeon/Myrinet cluster 512 SNL 3.4 GHz dual Xeons (64-bit) Myrinet 230 9
IBM p690+ 512 Daresbury 1.7 GHz Power4+ custom 1450 6
IBM BG/L 65536 LLNL 700 MHz PowerPC 440 custom 150 3
Cray XT3 10000 SNL 2.0 GHz Opteron Cstar 1100 7
Cray XT5 1920 SNL 2.4 GHz Opteron Cstar 1100 7

One-processor timings are also listed for some older machines whose characteristics are also given below.

Name Machine Processors Site CPU Interconnect Bandwidth Latency
Laptop Mac PowerBook 1 SNL 1 GHz G4 PowerPC N/A N/A N/A
ASCI Red Intel 1500 SNL 333 MHz Pentium III custom 310 18
Ross custom Linux cluster 64 SNL 500 MHz DEC Alpha Myrinet 100 65
Liberty HP Linux cluster 64 SNL 3.0 GHz dual Xeons (32-bit) Myrinet 230 9
Cheetah IBM p690 64 ORNL 1.3 GHz Power4 custom 1490 7

A billion-atom LJ timing is also given for a GPU cluster, with characteristics given below.

Name Machine GPUs Site GPU Interconnect Bandwidth Latency
Lincoln Intel/NVIDIA cluster 384 NCSA Tesla C1060 Infiniband 1500 12

For each of the 5 benchmarks, fixed- and scaled-size timings are shown in tables and in comparative plots. Fixed-size means that the same problem with 32,000 atoms was run on varying numbers of processors. Scaled-size means that when run on P processors, the number of atoms in the simulation was P times larger than the one-processor run. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms.

All listed CPU times are in seconds for 100 timesteps. Parallel efficiencies refer to the ratio of ideal to actual run time. For example, if perfect speed-up would have given a run-time of 10 seconds, and the actual run time was 12 seconds, then the efficiency is 10/12 or 83.3%. In most cases parallel runs were made on production machines while other jobs were running, which can sometimes degrade performance.

The files needed to run these benchmarks are part of the LAMMPS distribution. If your platform is sufficiently different from the machines listed, you can send your timing results and machine info and we'll add them to this page. Note that the CPU time (in seconds) for a run is what appears in the "Loop time" line of the output log file, e.g.

Loop time of 3.89418 on 8 procs for 100 steps with 32000 atoms 

These benchmarks are meant to span a range of simulation styles and computational expense for interaction forces. Since LAMMPS run time scales roughly linearly in the number of atoms simulated, you can use the timing and parallel efficiency data to estimate the CPU cost for problems you want to run on a given number of processors. As the data below illustrates, fixed-size problems generally have parallel efficiencies of 50% or better so long as the atoms/processor is a few hundred or more. Scaled-size problems generally have parallel efficiencies of 80% or more across a wide range of processor counts.

Thanks to the following individuals for running the various benchmarks:



One processor comparisons

This is a summary of single-processor LAMMPS performance in CPU secs per atom per timestep for the 5 benchmark problems. This is on a Dell 690 desktop Red Hat linux box with dual quad-core 2.66 GHz Intel Xeon processor using the Intel icc compiler. The ratios indicate that if the atomic LJ system has a normalized cost of 1.0, the bead-spring chains and granular systems run 2x faster, while the EAM metal and solvated protein models run 2.7x and 18x slower respectively. These differences are primarily due to the expense of computing a particular pairwise force field for a given number of neighbors per atom.

Problem: LJ Chain EAM Chute Rhodopsin
CPU/atom/step: 1.35E-6 6.25E-7 3.62E-6 5.91E-7 2.47E-5
Ratio to LJ: 1.0 0.46 2.69 0.44 18.4

Lennard-Jones liquid benchmark

Input script for this problem.

Atomic fluid:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Polymer chain melt benchmark

Input script for this problem.

Bead-spring polymer melt with 100-mer chains and FENE bonds:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


EAM metallic solid benchmark

Input script for this problem.

Cu metallic solid with embedded atom method (EAM) potential:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Granular chute flow benchmark

Input script for this problem.

Chute flow of packed granular particles with frictional history potential:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.


Rhodopsin protein benchmark

Input script for this problem.

All-atom rhodopsin protein in solvated lipid bilayer with CHARMM force field, long-range Coulombics via PPPM (particle-particle particle mesh), SHAKE constraints. This model contains counter-ions and a reduced amount of water to make a 32K atom system:

Performance data:

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.



Billion-atom LJ benchmarks

The Lennard-Jones benchmark problem described above (100 timesteps, reduced density of 0.8442, 2.5 sigma cutoff, etc) has been run on different machines for billion-atom tests. For the LJ benchmark LAMMPS requires a little less than 1/2 Terabyte of memory per billion atoms, which is used mostly for neighbor lists.

Machine # of Atoms Processors CPU Time (secs) Parallel Efficiency Flop Rate Date
Lincoln 1 million 1 GPU 4.24 100% 15.0 Gflop 2011
Lincoln 1 billion 288 GPUs 28.7 51.3% 2.21 Tflop 2011
Cray XT5 1 million 1 148.7 100% 427 Mflop 2011
Cray XT5 1 billion 1920 103.0 75.1% 616 Gflop 2011
Cray XT3 1 million 1 235.3 100% 270 MFlop 2006
Cray XT3 1 billion 10000 25.1 93.6% 2.53 Tflop 2006
Cray XT3 10 billion 10000 246.8 95.2% 2.57 Tflop 2006
Cray XT3 40 billion 10000 979.0 96.0% 2.59 Tflop 2006
IBM BG/L 1 million 1 898.3 100% 70.7 Mflop 2005
IBM BG/L 1 billion 4096 227.6 96.3% 279 Gflop 2005
IBM BG/L 1 billion 32K 30.2 90.7% 2.10 Tflop 2005
IBM BG/L 1 billion 64K 16.0 85.6% 3.97 Tflop 2005
IBM BG/L 10 billion 64K 148.9 92.0% 4.26 Tflop 2005
IBM BG/L 40 billion 64K 585.4 93.6% 4.34 Tflop 2005
ASCI Red 32000 1 62.88 100% 32.3 Mflop 2004
ASCI Red 750 million 1500 1156 85.0% 41.2 Gflop 2004

The parallel efficiencies are estimated from the per-atom CPU or GPU time for a large single processor run on each machine:

The aggregate flop rate is estimated using the following values for the pairwise interactions, which dominate the run time:

This is a conservative estimate in the sense that flops computed for atom pairs outside the force cutoff, building neighbor lists, and time integration are not counted. For the USER-CUDA package running on GPUs, Newton's 3rd law is not used (because it's faster not to), which doubles the pairwise interaction count, but that is not included in the flop rate either.



Interatomic potential comparisons

The following table summarizes the CPU cost of various potentials, as implemented in LAMMPS, each for a system commonly modeled by that potential. Except that the last 3 entries are for VASP timings, to give a comparison with DFT calculations. The details for the VASP runs are described more fully below.

The listed timing is CPU seconds per timestep per atom for a one processor run. Note that this is per timestep, as is the ratio to LJ; the timestep size is listed in the table. In each case a short 100-step run of a roughly 32000 atom system was performed. The speed-up is for a 4-processor run of the same 32000-atom system. Speed-ups greater than 4x are due to cache effects.

To first order, the CPU and memory cost for simulations with all these potentials scales linearly with the number of atoms N, and inversely with the number of processors P when running in parallel. This assumes the density doesn't change so that the neighbors per atom stays constant as you change N. This holds for N/P ratios larger than some threshhold, say 1000 atoms per processor. Thus you can use this data to estimate the run-time of different size problems on varying numbers of processors.

Potential System # Atoms Timestep Neighs/atom Memory CPU LJ Ratio P=4 Speed-up Input script Tarball
Granular chute flow 32000 0.0001 tau 7.2 33 Mb 5.08e-7 0.34x 5.94x in.granular bench_granular.tar.gz
FENE bead/spring polymer melt 32000 0.012 tau 9.7 8.4 Mb 5.32e-7 0.36x 4.11x in.fene bench_fene.tar.gz
Lennard-Jones* LJ liquid 32000 0.005 tau 76.9 12 Mb 1.48e-6 1.0x 3.80x in.lj bench_lj.tar.gz
DPD pure solvent 32000 0.04 tau 41.3 9.4 Mb 2.16e-6 1.46x 4.09x in.dpd bench_dpd.tar.gz
EAM bulk Cu 32000 5 fmsec 75.5 13 Mb 3.59e-6 2.4x 3.98x in.eam bench_eam.tar.gz
Tersoff bulk Si 32000 1 fmsec 16.6 9.2 Mb 6.01e-6 4.1x 4.10x in.tersoff bench_tersoff.tar.gz
Stillinger-Weber bulk Si 32000 1 fmsec 30.0 11 Mb 6.10e-6 4.1x 4.01x in.sw bench_sw.tar.gz
ADP bulk Ni 32000 5 fmsec 83.6 25 Mb 9.23e-6 6.2x 3.54x in.adp bench_adp.tar.gz
EIM crystalline NaCl 32000 0.5 fmsec 98.9 14 Mb 9.69e-6 6.5x 3.90x in.eim bench_eim.tar.gz
REBO polyethylene 32640 0.5 fmsec 149 33 Mb 1.35e-5 9.1x 3.88x in.rebo bench_rebo.tar.gz
SPC/E liquid water 36000 2 fmsec 700 86 Mb 1.43e-5 9.7x 3.87x in.spce bench_spce.tar.gz
CHARMM + PPPM solvated protein 32000 2 fmsec 376 124 Mb 2.01e-5 13.6x 3.82x in.protein bench_protein.tar.gz
MEAM bulk Ni 32000 5 fmsec 48.8 54 Mb 2.31e-5 15.6x 3.76x in.meam bench_meam.tar.gz
Peridynamics glass fracture 32000 22.2 nsec 422 144 Mb 2.42e-5 16.4x 4.13x in.peri bench_peri.tar.gz
Gay-Berne* ellipsoid mixture 32768 0.002 tau 140 21 Mb 4.09e-5 28.3x 3.96x in.gb bench_gb.tar.gz
AIREBO polyethylene 32640 0.5 fmsec 681 101 Mb 8.09e-5 54.7x 3.71x in.airebo bench_airebo.tar.gz
COMB crystalline SiO2 32400 0.2 fmsec 572 85 Mb 4.19e-4 284x 4.05x in.comb bench_comb.tar.gz
eFF H plasma 32000 0.001 fmsec 5066 365 Mb 4.52e-4 306x 3.86x in.eff bench_eff.tar.gz
ReaxFF PETN crystal 16240 0.1 fmsec 667 425 Mb 4.99e-4 337x 3.90x in.reax bench_reax.tar.gz
ReaxFF/C PETN crystal 32480 0.1 fmsec 667 976 Mb 2.73e-4 185x 3.47x in.reaxc bench_reaxc.tar.gz
VASP/small water 192/512 0.3 fmsec N/A 320 procs 26.2 17.7e6 100% N/A N/A
VASP/medium CO2 192/1024 0.8 fmsec N/A 384 procs 252 170e6 100% N/A N/A
VASP/large Xe 432/3456 2.0 fmsec N/A 384 procs 1344 908e6 100% N/A N/A

Notes:

Details for different systems:


Machines

This section lists characteristics of machines used in the benchmarking along with options used in compiling LAMMPS. The communication parameters are for bandwidth and latency at the MPI level, i.e. what a program like LAMMPS sees.

Desktop = Dell 690 desktop workstation running Red Hat linux

Mac laptop = PowerBook G4 running OS X 10.3

ASCI Red = ASCI Intel Tflops MPP

Ross = CPlant DEC Alpha/Myrinet cluster

Liberty = Intel/Myrinet cluster packaged by HP

Cheetah = IBM p690 cluster

Xeon/Myrinet cluster = Spirit

IBM p690+ cluster = HPCx

IBM BG/L = Blue Gene Light

Cray XT3 = Red Storm

Cray XT5 = xtp

Lincoln = GPU cluster