LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Mailing List Archives
Re: [lammps-users] Improved performance for setup on large numbers of MPI ranks
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lammps-users] Improved performance for setup on large numbers of MPI ranks

From: Steve Plimpton <sjplimp@...24...>
Date: Wed, 23 Aug 2017 13:28:53 -0600

Hi Chris - these sound like useful enhancements for setting up large systems.
The speed-ups are impressive.  We'll just need to look at
them more carefully than we do options that are simply add-ons.

Might be a couple weeks before I can look at it.


On Mon, Aug 21, 2017 at 1:28 PM, Christopher Knight <cjknight2009@...24...> wrote:
Hi Axel,

That’s fine, I understand; providing the changes as a package was just intended to be a convenient mechanism for folks to look and play. Most of these changes do need careful thinking regarding their correctness for the many different ways one can run LAMMPS. If there is sufficient interest for including any of these suggestions, then I can certainly help out.


On Aug 21, 2017, at 2:18 PM, Axel Kohlmeyer <akohlmey@...24...> wrote:


the way LAMMPS is set up, we cannot accept packages that overwrite base classes.

since this touches crucial core elements of LAMMPS, i suggest you work with steve plimpton directly to evaluate how to integrate such changes, either as options or entirely.
at best, i can set up a branch in the official LAMMPS repo to bootstrap testing and simplify integration, but i think it would be best to have steve (copied) review your changes first and then suggest the suitable way to move forward.


On Mon, Aug 21, 2017 at 3:09 PM, Christopher Knight <cjknight2009@...24...> wrote:

For consideration, attached is a small package/patch COMM_NPROCS that addresses some scalability performance issues I’ve observed when setting up large molecular systems on large numbers of MPI ranks (e.g. +100Ms of particles on +10Ks MPI ranks). The suggested changes largely address logic in a few places that involves loops over processors, which become prohibitive in setting up large-scale runs. I suspect most users will see negligible performance impact from these changes, but if you regularly run large systems on +10K MPI ranks, then they might be worth taking a look at if you observe rather long setup times when replicating systems, building molecular topologies, or creating remap plans for 3D FFTs.

With these changes, I was able to successfully setup a modified rhodo replicated system (without kspace method) with 36.86 billion particles on 786,432 MPI ranks in 42 seconds compared to the original projected time of 18 hours. Again, this was just setting up the simulation in LAMMPS. With the changes below, I’m currently projecting the PPPM setup time to be ~6 minutes for the same system, compared to the original projected time of 12 days. And, in this case, the PPPM setup time would further improve if an additional hint could be passed during creation of one of the plans (the last one).

My edits and suggestions certainly don’t cover 100% of use cases in LAMMPS (everything works with rhodo), but hopefully they all can be adapted and included in the LAMMPS source and, at the very least, activated in certain cases and/or by specific request from a user with additional keywords as in my implementation. The modifications are based from a recent git pull.

Thanks for your consideration,



*) replicating large systems

For various benchmarking purposes, I routinely use the data file of a small system and replicate to quite large systems. This can become quite prohibitive for large numbers of replicas and ranks with the current implementation that loops over both replicas and ranks. The use_more_memory logic included here collects the old_avec buffer on all MPI ranks (again, we’re assuming original system is small) and then each rank tests+adds particles from only those replicas that overlap with the local sub-domain.

In the LAMMPS input file with the attached code, one just needs to specify “comm_modify ring_neighbor” before “read_data” or at the very least “replicate”.

With this change, the time spent in Replicate::command() for a rhodo replicated system of 3.07 billion particles on 65,536 MPI ranks decreased from 692 to 14 seconds (speedup of 49x).

*) building molecular topologies

In Special::build(), there are several calls to comm->ring() that pass molecular topology info around a ring that includes every processor in communicator. For most cases, the bonded interactions would only span the closest neighboring processors and only those processors would need to be included in the comm->ring() communication pattern. The rhodo benchmark is an example of this. This probably isn’t true for 100% use cases (maybe it is), but when it is, significant performance can be gained by only communicating with closest neighboring processors. The ring_neighbor() function and associated do_ring_neighbor logic address by only communicating with 26th closest neighbors for the case of uniform layout.

With this change, the time spent in Special::build() for a rhodo replicated system of 1.54 billion particles on 32,768 MPI ranks decreased from 2254 to 14 seconds (speedup of 161x).

One minor thing I noticed is that the counts for shake clusters printed to screen should be cast as bigint, not int, when LAMMPS is compiled with LAMMPS_BIGBIG. This is at the bottom of FixShake::find_clusters(), which is another function that benefits from the comm->ring() improvements.

*) creating remap plans

For most remap plans, like those needed for 3D FFTs in PPPM, the time for setting up the commringlist when collective MPI communication is used is not that problematic since the subset of ranks included is relatively small. However, the last remap plan created generates a commringlist with all procs in the MPI communicator (at least for rhodo) and thus its construction scales as number of processors squared. This quickly becomes prohibitive at large-scale to setup a simulation on a large number of MPI ranks, especially if we know at the outset that all MPI ranks will be included anyways in the new communicator. Building the commringlist in the modified remap_3d_create_plan() still scales roughly quadratically with number of processors, but with a far smaller prefactor, such that large scale calculations should now be possible with PPPM. Also, if we could confirm that the communicator for the last remap plan created will always be a duplicate of the communicator passed in, then one could pass a hint to the remap_3d_create_plan() function to simply duplicate the input communicator in that case and completely skip the commringlist logic.

With this change, the time spent creating the final remap plan for a rhodo replicated system with 1.54 billion particles on 32,768 MPI ranks decreased from 843 to 1.88 seconds (speedup of 448x). While not ideal, the new logic would at least make it practical to setup a 37 billion particle system on 0.75 million MPI ranks (projecting 6 minutes as opposed to 12 days). Again, the other created plans build communicators over relatively small subsets of processors within the input communicator. With these changes, the time needed for creation of the last plan is approaching that of the earlier plans, but is still larger. For this last plan, if appropriate, it would be very efficient if a hint could be passed to remap_3d_create_plan() when it is known that plan communicator will contain all processors in the input communicator and to just duplicate the communicator instead of building commringlist. This is true for how the rhodo benchmark is run by default, but maybe there are cases where this isn’t true and more care is needed.

*) example of modified rhodo benchmark using these improvements; see “comm_modify” and “replicate” commands

# Rhodopsin model

units           real
neigh_modify    delay 5 every 1

atom_style      full
bond_style      harmonic
angle_style     charmm
dihedral_style  charmm
improper_style  harmonic
pair_style      lj/charmm/coul/long 8.0 10.0
pair_modify     mix arithmetic
kspace_style    pppm 1e-4

comm_modify     ring_neighbor
read_data       data.rhodo

replicate       ${vX} ${vY} ${vZ} memory

fix             1 all shake 0.0001 5 0 m 1.0 a 232
fix             2 all npt temp 300.0 300.0 100.0 &
                z 0.0 0.0 1000.0 mtk no pchain 0 tchain 1

special_bonds   charmm

thermo          50
thermo_style    multi
timestep        2.0

run             ${vNSTEPS}

Check out the vibrant tech community on one of the world's most
engaging tech sites,!
lammps-users mailing list

Dr. Axel Kohlmeyer  akohlmey@...43...4...
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.

Check out the vibrant tech community on one of the world's most
engaging tech sites,!
lammps-users mailing list