LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Mailing List Archives
Re: [lammps-users] Fwd: Accelerate the simulation-kspace as the bottleneck
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lammps-users] Fwd: Accelerate the simulation-kspace as the bottleneck


From: Azade Yazdan Yar <azade.yazdanyar@...24...>
Date: Sun, 25 Jun 2017 11:30:47 +0200

Hi, 

Thanks a lot Axel for all your comments and modifications. To sum up, changing ewald to pppm and implementing your changes accelerated the simulation by 12x times which is very good.
However, I am facing the 'Out of range atoms - cannot compute PPPM' error. I will keep reading about it and will start another thread for it.

Sincerely,
Azade 

On Sat, Jun 24, 2017 at 10:03 PM, Axel Kohlmeyer <akohlmey@...24...> wrote:


On Sat, Jun 24, 2017 at 2:09 PM, Azade Yazdan Yar <azade.yazdanyar@...24...> wrote:
Hi,

Here is my input:

​the next time, please compress your attached text files with gzip, so they are much smaller and thus less of a bother.​
onward to the problem at hand. first off, there is a small issue with your data file, you have atoms outside the box. your input only works by accident.​
​also, your simulation box has too much vacuum in z-direction, this is not needed on input, but will be added automatically with  kspace_modify slab 3.0.
because of that, you have *much* more vacuum in your box, and due to using a kspace solver, you still need to do work in that vacuum. luckily, LAMMPS can update this for you with a few small script commands:

units           metal
atom_style      full
boundary        p p p

bond_style      hybrid morse harmonic
angle_style     harmonic

read_data        final-nvt.data

# reset image flags in z-direction and shrink box
set group all image NULL NULL 0
change_box all z final 10 80

write_data       updated.data

​but this is only the start. please find below added/modified commands and my comments.​

units           metal
atom_style      full

​processors * * 1

this changes how LAMMPS subdivides the system. it usually does this by volume, but in z-direction, that is a bad idea (due to the vacuum for the slab), so we enforce that there is only a domain decomposition in x and y. this helps for running in parallel.​

 

boundary        p p f                                 # slab in z direction

bond_style      hybrid morse harmonic
angle_style     harmonic

read_data        final-nvt.data

​read_data updated.data

this reads in the updated box with the corrected image flags and the reduced dimensions in z direction.

pair_style    hybrid buck/coul/long 12.66 lj/cut/coul/long 9.8 12.66

pair_coeff 1 *   buck/coul/long 0.0 1.0 0.0 # ti
pair_coeff 2 *   buck/coul/long 0.0 1.0 0.0 # ts
pair_coeff 3 *   buck/coul/long 0.0 1.0 0.0 # o
pair_coeff 4 *   buck/coul/long 0.0 1.0 0.0 # ot
pair_coeff 5 *   buck/coul/long 0.0 1.0 0.0 # ht
pair_coeff 6 *   buck/coul/long 0.0 1.0 0.0 # ob
pair_coeff 7 *   buck/coul/long 0.0 1.0 0.0 # hb
pair_coeff 8 *   buck/coul/long 0.0 1.0 0.0 # os

pair_coeff 9 *   buck/coul/long 0.0 1.0 0.0 # ow
pair_coeff 10 *   buck/coul/long 0.0 1.0 0.0 # hw

pair_coeff 1 1 buck/coul/long 31119.34421 0.154 5.249854339 # ti-ti
pair_coeff 1 2 buck/coul/long 31119.34421 0.154 5.249854339 # ti-ts
pair_coeff 2 2 buck/coul/long 31119.34421 0.154 5.249854339 # ts-ts

pair_coeff 1 3 buck/coul/long 16957.06212 0.194 12.58965351 # ti-o
pair_coeff 1 6 buck/coul/long 16957.06212 0.194 12.58965351 # ti-ob
pair_coeff 1 8 buck/coul/long 16957.06212 0.194 12.58965351 # ti-os

pair_coeff 2 3 buck/coul/long 16957.06212 0.194 12.58965351 # ts-o
pair_coeff 2 6 buck/coul/long 16957.06212 0.194 12.58965351 # ts-ob
pair_coeff 2 8 buck/coul/long 16957.06212 0.194 12.58965351 # ts-os

pair_coeff 1 4 buck/coul/long 13680.19393 0.194 12.58965351 # ti-ot
pair_coeff 2 4 buck/coul/long 13680.19393 0.194 12.58965351 # ts-ot

pair_coeff 3 3 buck/coul/long 11782.43392 0.234 30.21916735 # o-o
pair_coeff 3 4 buck/coul/long 11782.43392 0.234 30.21916735 # o-ot
pair_coeff 3 6 buck/coul/long 11782.43392 0.234 30.21916735 # o-ob
pair_coeff 3 8 buck/coul/long 11782.43392 0.234 30.21916735 # o-os

pair_coeff 4 4 buck/coul/long 11782.43392 0.234 30.21916735 # ot-ot
pair_coeff 4 6 buck/coul/long 11782.43392 0.234 30.21916735 # ot-ob
pair_coeff  4   8 buck/coul/long 11782.43392 0.234 30.21916735 # ot-os

pair_coeff 6 6 buck/coul/long 11782.43392 0.234 30.21916735 # ob-ob
pair_coeff 6   8 buck/coul/long 11782.43392 0.234 30.21916735 # ob-os

pair_coeff 8 8 buck/coul/long 11782.43392 0.234 30.21916735 # os-os

pair_coeff 1 9 buck/coul/long 1239.879126 0.265 6.417724 # ti-ow
pair_coeff 2 9 buck/coul/long 1239.879126 0.265 6.417724 # ts-ow

pair_coeff 9 9 lj/cut/coul/long 0.00673835 3.166 # ow-ow
pair_coeff 2 9 lj/cut/coul/long 0.00673835 3.166 # o-ow
pair_coeff 4 9 lj/cut/coul/long 0.00673835 3.166 # ot-ow
pair_coeff 6 9 lj/cut/coul/long 0.00673835 3.166 # ob-ow

kspace_style     pppm 1e-6
kspace_modify      slab  3.0

​kspace_modify order 7

the kspace order parameter tweaks the PPPM algorithm. a higher order means more computation per particle but a coarser grid with less gridpoints, and smaller order means less work per atom, but a larger grid. allowed values are between 2 and 7. for a small system with lots of vacuum running on few processors are higher order is faster. for a large, dense system​ running on many processors a smaller order is faster. so we go with 7 here. you can experiment on your machine and find empirically which is faster for you.

 

special_bonds   coul 0.0 0.0 1.0    

group   fixed type 1 3
group   free  subtract all fixed

neighbor        2.0 bin
neigh_modify    delay 0 every 1 check yes exclude group fixed fixed

timestep 0.0007

thermo_style   custom step time temp press pe ke etotal enthalpy evdwl ecoul lx ly lz
thermo 50
dump        coordinates all xyz 50 traj.xyz

comm_modify   cutoff 20.0

fix    1_shake free shake 0.0001 20 0 b 1 2 a 1 2
​​
fix         wall    all wall/reflect zhi EDGE

fix         wall    all wall/reflect zhi 80.0

this has nothing to do with performance, but how the slab configuration works in LAMMPS. as it changes your box boundaries internally, you must not use the internal value, but the original limit, i.e. 80.0 (after the box shrink).
 

#velocity   all create 50.0 1281937 

fix         fix_nvt free nvt temp 50.0 100.0 100.0
run 5000

Please find the data file attached. Thanks for your help.

​for faster testing, i've reduce the number of steps to 500​ and with those i have on my test machine with the original input:

Loop time of 166.931 on 4 procs for 500 steps with 6575 atoms

Performance: 0.181 ns/day, 132.485 hours/ns, 2.995 timesteps/s
71.3% CPU use with 4 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.003526   | 19.096     | 47.112     | 394.6 | 11.44
Bond    | 0.001023   | 0.009013   | 0.017905   |   8.1 |  0.01
Kspace  | 117.32     | 145.42     | 164.76     | 143.8 | 87.12
Neigh   | 1.1234     | 1.1267     | 1.1305     |   0.3 |  0.67
Comm    | 0.009849   | 0.3959     | 0.57072    |  35.8 |  0.24
Output  | 0.068404   | 0.068759   | 0.069661   |   0.2 |  0.04
Modify  | 0.61466    | 0.69036    | 0.86142    |  12.0 |  0.41
Other   |            | 0.1199     |            |       |  0.07

​which changes to the following after applying the tweaks mentioned above:

Loop time of 38.7467 on 4 procs for 500 steps with 6575 atoms

Performance: 0.780 ns/day, 30.751 hours/ns, 12.904 timesteps/s
93.2% CPU use with 4 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 17.222     | 18.058     | 18.372     |  11.4 | 46.61
Bond    | 0.007493   | 0.007959   | 0.00869    |   0.5 |  0.02
Kspace  | 18.755     | 19.085     | 19.878     |  10.6 | 49.25
Neigh   | 0.57585    | 0.57733    | 0.57881    |   0.2 |  1.49
Comm    | 0.42194    | 0.45468    | 0.50449    |   4.8 |  1.17
Output  | 0.054895   | 0.055042   | 0.055469   |   0.1 |  0.14
Modify  | 0.43946    | 0.4433     | 0.44685    |   0.4 |  1.14
Other   |            | 0.06576    |            |       |  0.17

​so an improvement of over 4x faster. you can also see at the change in the min/avg/max time for Pair, that the work here is now much better balanced between processors.

HTH,
     axel. ​

 

Sincerely,
Azade

On Sat, Jun 24, 2017 at 6:50 PM, Axel Kohlmeyer <akohlmey@...24...> wrote:


On Sat, Jun 24, 2017 at 12:36 PM, Azade Yazdan Yar <azade.yazdanyar@...43...4...> wrote:
Hi, 

The cutoff for Columb is 12A (I have buckingham and lj as the short-range with cutoffs of 12 and 10A, respectively). 
At the beginning, I used ewald with an accuracy of 10e-6. Then I switched to pppm (with the same accuracy) as I saw the DFFT_SINGLE option. 
I am using intel and intelmpi for compiling LAMMPS. If I understand correctly, you are suggesting to recompile it with openmpi?

​no, i am not.​
 

I have 8 atom types in the solid slab out of which 2 are Ti, 4 are O and 2 are H. The two Ti, for example, have similar pair interactions with other atom types in my system. Is there a way that I can tell Lammps to do such calculations once instead of twice? the charge for these two Ti are different and that's why I am defining two atom types. These calculations, though, are not what is slowing down the simulation, after all.

​what you are suggesting makes no sense. ​the slowdown is obviously​ due to the kspace calculation. are you using the kspace_modify slab option? or just a large vacuum region?


I also defined a group which defined two atom types which I have in the solid slab. I excluded this group in 'neigh_modify' and applied the thermostat etc. on the rest of atom types as I thought this can also reduce the number of calculations.

​no, it doesn't. it only applies to the Pair part of the calculation, and that is not a significant part of your total CPU time.

we need to understand, why kspace takes so much time. this is difficult to diagnose from remote, so please provide your (complete!) input.

axel.

 
 

Thanks a lot.

Sincerely,
Azade

On Sat, Jun 24, 2017 at 4:27 PM, Axel Kohlmeyer <akohlmey@...24...> wrote:


On Sat, Jun 24, 2017 at 6:15 AM, Azade Yazdan Yar <azade.yazdanyar@...24...> wrote:
Hi,

I have a system which consists of a solid slab, water and a vacuum layer. I am using pppm and the 'slab' option.
I used to use dlpoly for my system but due to slow performance and its poor scalability for my specific system, I decided to see how better can lammps do. I reduced the number of total atoms in my system to one forth (before 24,000, now 6,800) as I figured out that my system was unnecessarily large, before.
After running some tests with lammps and 'ewald' as the kspace, this was the breakdown of lammps performance:

​what is your pair style (coulomb) cutoff? 

have you tried pppm instead of ewald?

have you tried using a mix of MPI and OpenMP?

in general, there is a lower limit as to how many atoms per processor you can have until there is no more speedup. often this is in the range of a few 100s of atoms. your system is very small, so you cannot expect spectacular scaling here.

axel. 

 

Loop time of 1550.53 on 24 procs for 5000 steps with 6575 atoms

Performance: 0.195 ns/day, 123.058 hours/ns, 3.225 timesteps/s
99.6% CPU use with 24 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.0082662  | 34.619     | 86.531     | 610.4 |  2.23
Bond    | 0.0041995  | 0.014318   | 0.040231   |  10.9 |  0.00
Kspace  | 1433       | 1485.3     | 1520.9     |  94.2 | 95.80
Neigh   | 0.947      | 0.96571    | 1.0072     |   2.1 |  0.06
Comm    | 0.01577    | 1.6679     | 2.531      |  67.3 |  0.11
Output  | 25.249     | 25.249     | 25.261     |   0.0 |  1.63
Modify  | 1.8545     | 2.3485     | 3.1424     |  28.0 |  0.15
Other   |            | 0.3262     |            |       |  0.02

Nlocal:    273.958 ave 759 max 0 min
Histogram: 12 0 0 0 0 6 2 0 0 4
Nghost:    7101.79 ave 15694 max 0 min
Histogram: 4 4 4 0 0 0 8 0 0 4
Neighs:    141151 ave 404803 max 0 min
Histogram: 12 0 0 2 2 0 2 2 0 4

As you see, kspace is the bottleneck, so I read that I can instead use pppm and DFFT_SINGLE to accelerate the simulation.
I recompiled lammps, and here you can see the breakdown again:

Loop time of 307.837 on 24 procs for 5000 steps with 6575 atoms

Performance: 0.982 ns/day, 24.432 hours/ns, 16.242 timesteps/s
97.2% CPU use with 24 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.012733   | 37.727     | 112.62     | 691.2 | 12.26
Bond    | 0.0044436  | 0.013352   | 0.03664    |   9.4 |  0.00
Kspace  | 188.99     | 264.28     | 303.19     | 263.7 | 85.85
Neigh   | 1.2895     | 1.3094     | 1.3538     |   1.9 |  0.43
Comm    | 0.014927   | 1.7641     | 2.6289     |  67.4 |  0.57
Output  | 0.10792    | 0.10805    | 0.10915    |   0.1 |  0.04
Modify  | 1.8796     | 2.3371     | 3.041      |  25.7 |  0.76
Other   |            | 0.2954     |            |       |  0.10

Nlocal:    273.958 ave 729 max 0 min
Histogram: 12 0 0 0 0 1 6 1 0 4
Nghost:    7108.12 ave 15831 max 0 min
Histogram: 4 4 4 0 0 0 8 0 0 4
Neighs:    77342.3 ave 238561 max 0 min
Histogram: 12 0 0 4 0 1 1 2 1 3

So the speed increased in general. I tried different number of nodes; having 4 nodes will give me an efficiency of 50% which I can afford. When using 4 nodes, I will be able to do 2 ns/day. Before, with dlpoly and the larger system, I used to do 0.5 ns/day; and this was in serial.

So as I see it, even though my system is smaller now, the performance has not improved as I hoped. Can anyone give my some ideas if there are extra things I can try? This speed is still computationally expensive for my goal. I will be happy to give you more information on system details, but at this point, I am not sure what details are of interest.

Sincerely,
Azade

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
lammps-users mailing list
lammps-users@...655....net
https://lists.sourceforge.net/lists/listinfo/lammps-users




--
Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
lammps-users mailing list
lammps-users@...655....net
https://lists.sourceforge.net/lists/listinfo/lammps-users




--
Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
lammps-users mailing list
lammps-users@...655....net
https://lists.sourceforge.net/lists/listinfo/lammps-users




--
Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.