LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Mailing List Archives
[lammps-users] Accelerate the simulation-kspace as the bottleneck
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lammps-users] Accelerate the simulation-kspace as the bottleneck


From: Azade Yazdan Yar <azade.yazdanyar@...24...>
Date: Sat, 24 Jun 2017 12:15:06 +0200

Hi,

I have a system which consists of a solid slab, water and a vacuum layer. I am using pppm and the 'slab' option.
I used to use dlpoly for my system but due to slow performance and its poor scalability for my specific system, I decided to see how better can lammps do. I reduced the number of total atoms in my system to one forth (before 24,000, now 6,800) as I figured out that my system was unnecessarily large, before.
After running some tests with lammps and 'ewald' as the kspace, this was the breakdown of lammps performance:

Loop time of 1550.53 on 24 procs for 5000 steps with 6575 atoms

Performance: 0.195 ns/day, 123.058 hours/ns, 3.225 timesteps/s
99.6% CPU use with 24 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.0082662  | 34.619     | 86.531     | 610.4 |  2.23
Bond    | 0.0041995  | 0.014318   | 0.040231   |  10.9 |  0.00
Kspace  | 1433       | 1485.3     | 1520.9     |  94.2 | 95.80
Neigh   | 0.947      | 0.96571    | 1.0072     |   2.1 |  0.06
Comm    | 0.01577    | 1.6679     | 2.531      |  67.3 |  0.11
Output  | 25.249     | 25.249     | 25.261     |   0.0 |  1.63
Modify  | 1.8545     | 2.3485     | 3.1424     |  28.0 |  0.15
Other   |            | 0.3262     |            |       |  0.02

Nlocal:    273.958 ave 759 max 0 min
Histogram: 12 0 0 0 0 6 2 0 0 4
Nghost:    7101.79 ave 15694 max 0 min
Histogram: 4 4 4 0 0 0 8 0 0 4
Neighs:    141151 ave 404803 max 0 min
Histogram: 12 0 0 2 2 0 2 2 0 4

As you see, kspace is the bottleneck, so I read that I can instead use pppm and DFFT_SINGLE to accelerate the simulation.
I recompiled lammps, and here you can see the breakdown again:

Loop time of 307.837 on 24 procs for 5000 steps with 6575 atoms

Performance: 0.982 ns/day, 24.432 hours/ns, 16.242 timesteps/s
97.2% CPU use with 24 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.012733   | 37.727     | 112.62     | 691.2 | 12.26
Bond    | 0.0044436  | 0.013352   | 0.03664    |   9.4 |  0.00
Kspace  | 188.99     | 264.28     | 303.19     | 263.7 | 85.85
Neigh   | 1.2895     | 1.3094     | 1.3538     |   1.9 |  0.43
Comm    | 0.014927   | 1.7641     | 2.6289     |  67.4 |  0.57
Output  | 0.10792    | 0.10805    | 0.10915    |   0.1 |  0.04
Modify  | 1.8796     | 2.3371     | 3.041      |  25.7 |  0.76
Other   |            | 0.2954     |            |       |  0.10

Nlocal:    273.958 ave 729 max 0 min
Histogram: 12 0 0 0 0 1 6 1 0 4
Nghost:    7108.12 ave 15831 max 0 min
Histogram: 4 4 4 0 0 0 8 0 0 4
Neighs:    77342.3 ave 238561 max 0 min
Histogram: 12 0 0 4 0 1 1 2 1 3

So the speed increased in general. I tried different number of nodes; having 4 nodes will give me an efficiency of 50% which I can afford. When using 4 nodes, I will be able to do 2 ns/day. Before, with dlpoly and the larger system, I used to do 0.5 ns/day; and this was in serial.

So as I see it, even though my system is smaller now, the performance has not improved as I hoped. Can anyone give my some ideas if there are extra things I can try? This speed is still computationally expensive for my goal. I will be happy to give you more information on system details, but at this point, I am not sure what details are of interest.

Sincerely,
Azade