LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Mailing List Archives
Re: [lammps-users] Fix Addforce Slow Down the Computational Efficiency
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lammps-users] Fix Addforce Slow Down the Computational Efficiency


From: Axel Kohlmeyer <akohlmey@...24...>
Date: Wed, 18 Apr 2018 15:46:00 -0400

On Wed, Apr 18, 2018 at 3:33 PM, Wei Peng <pengwrpi2@...24...> wrote:
> Dear Axel,
>
> I do appreciate your patience on answering my questions. If I could not find

this patience is limited, though, and you are reaching this limit right now.

> an ideal solution, I am thinking about updating all the forces every 10
> steps instead of every step, meaning we add the same forces from time step 1
> to time step 10, and update the force at time step 11 and keep using the
> updated forces till time step 20. In this way, the communication cost would
> be reduced by 10 times.

please let me state, that if find your behavior very irritating. you
ask for help, but then you do not provide sufficient information to
test and debug individually, but rather trust your assessments. since
those have been proved incorrect, i refuse to speculate any further.
there may be all kinds of things wrong with your simulation input,
that have nothing to do with the fix addforce operation.
there are other details in the timing output, that look suspicious, too.

at this state, it is also a bad idea to continue debugging the *whole*
system, but rather you should construct a much simplified test system,
and build the whole procedure step by step to understand which is the
time critical operation.

> However, I could not find a way of changing the frequency of the evaluation
> of a variable in LAMMPS. Can you tell me if there is any way of doing that?

variables are updated when they get accessed.

overall, i consider your idea a very bad idea, since you are simply
avoiding to understand what is the real problem is.
on top of that, you *are* doing tasks, that are suboptimal compared to
writing C++ code, so i consider that your whole approach needs to be
revised.

if you want to continue like this, you are free to do so, but don't
expect much help.
if you do want further help, you have to make it *much* easier to help you.

axel.

> Thanks,
> Wei
>
> Wei Peng
> Graduate Student at Rensselaer Polytechnic Institute
> Department of Materials Science and Engineering
> 110 8th Street, Troy, NY 12180
>
> On Tue, Apr 17, 2018 at 4:52 PM, Axel Kohlmeyer <akohlmey@...24...> wrote:
>>
>> On Tue, Apr 17, 2018 at 4:47 PM, Wei Peng <pengwrpi2@...24...> wrote:
>> > Dear Axel,
>> >
>> > Thank you for your followup! I tried the method you suggested, but I got
>> > the
>> > performance worse. I am attaching the input file here, can you tell me
>> > which
>> > part was wrong? Thank you!
>>
>> there is no usable input here anywhere. i cannot say anything.
>> perhaps, you are following a red herring and have some other
>> performance issue, that is somewhere else.
>>
>> axel.
>>
>> >
>> > group            np1 id 1:429:1
>> >
>> > #group atoms into nanoparticle 1
>> >
>> > compute          coord1 np1 property/atom xu yu zu
>> > compute          c1 np1 com
>> > variable         cond1 atom "(c_coord1[1] - c_c1[1]) * 1 > 0.0"
>> > group            half1 dynamic np1 var cond1 every 1
>> > run              1
>> > group            half1 static
>> >
>> > #find the half of 1st nanoparticle that I am gonna apply force
>> >
>> > variable         famp equal "0.1"
>> >
>> > variable         dirx1 atom "c_coord1[1]-c_c1[1]"
>> > variable         diry1 atom "c_coord1[2]-c_c1[2]"
>> > variable         dirz1 atom "c_coord1[3]-c_c1[3]"
>> > variable         diramp1 atom "sqrt(v_dirx1^2 + v_diry1^2 + v_dirz1^2)"
>> > variable         fx1 atom "gmask(half1)*v_famp*v_dirx1/v_diramp1"
>> > variable         fy1 atom "gmask(half1)*v_famp*v_diry1/v_diramp1"
>> > variable         fz1 atom "gmask(half1)*v_famp*v_dirz1/v_diramp1"
>> > fix              addfnp1 half1 addforce v_fx1 v_fy1 v_fz1
>> >
>> > #calculate and apply for the force onto the half of the 1st
>> > nanoparticle.
>> > #half1 is the group ID of the half of the nanoparticle that was applied
>> > the
>> > force to
>> >
>> > variable         nliquid equal "count(liquid)"
>> >
>> > compute          ftotal all reduce sum v_fx1 v_fy1 v_fz1 v_fx2 v_fy2
>> > v_fz2
>> > v_fx3 v_fy3 v_fz3 v_fx4 v_fy4 v_fz4 v_fx5 v_fy5 v_fz5 v_fx6 v_fy6 v_fz6
>> > v_fx7 v_fy7 v_fz7 v_fx8 v_fy8 v_fz8
>> > variable         fxliquid equal "-1.0 * (c_ftotal[1] + c_ftotal[4] +
>> > c_ftotal[7] + c_ftotal[10] + c_ftotal[13] + c_ftotal[16] + c_ftotal[19]
>> > +
>> > c_ftotal[22]) / v_nliquid"
>> > variable         fyliquid equal "-1.0 * (c_ftotal[2] + c_ftotal[5] +
>> > c_ftotal[8] + c_ftotal[11] + c_ftotal[14] + c_ftotal[17] + c_ftotal[20]
>> > +
>> > c_ftotal[23]) / v_nliquid"
>> > variable         fzliquid equal "-1.0 * (c_ftotal[3] + c_ftotal[6] +
>> > c_ftotal[9] + c_ftotal[12] + c_ftotal[15] + c_ftotal[18] + c_ftotal[21]
>> > +
>> > c_ftotal[24]) / v_nliquid"
>> > fix              addliquid liquid addforce v_fxliquid v_fyliquid
>> > v_fzliquid
>> >
>> > # here is how I sum up all the applied force and take the opposite to
>> > the
>> > liquid atoms"
>> >
>> > Loop time of 356.273 on 1024 procs for 10000 steps with 53432 atoms
>> >
>> > Performance: 24251.090 tau/day, 28.068 timesteps/s
>> > 100.0% CPU use with 1024 MPI tasks x 1 OpenMP threads
>> >
>> > MPI task timing breakdown:
>> > Section |  min time  |  avg time  |  max time  |%varavg| %total
>> > ---------------------------------------------------------------
>> > Pair    | 0.12908    | 0.34635    | 0.37527    |   2.9 |  0.10
>> > Bond    | 0.013135   | 0.072627   | 3.3069     |  93.9 |  0.02
>> > Neigh   | 7.3525     | 7.7132     | 7.8345     |   3.1 |  2.16
>> > Comm    | 3.8194     | 4.5665     | 9.9215     |  53.9 |  1.28
>> > Output  | 0.0743     | 0.07437    | 0.079351   |   0.1 |  0.02
>> > Modify  | 337.3      | 342.59     | 343.31     |   6.4 | 96.16
>> > Other   |            | 0.9084     |            |       |  0.25
>> >
>> > Nlocal:    52.1797 ave 267 max 42 min
>> > Histogram: 988 15 10 3 4 2 0 0 1 1
>> > Nghost:    236.644 ave 752 max 208 min
>> > Histogram: 955 36 13 7 5 4 2 0 1 1
>> > Neighs:    263.897 ave 347 max 47 min
>> > Histogram: 5 4 1 6 4 22 305 500 171 6
>> >
>> > Total # of neighbors = 270231
>> > Ave neighs/atom = 5.05747
>> > Ave special neighs/atom = 0.63243
>> > Neighbor list builds = 2220
>> > Dangerous builds = 0
>> > Total wall time: 0:05:58
>> >
>> > # This is the performance report. With the new method, the ave running
>> > time
>> > 28 timesteps/second. Previously, it was 45 timesteps/second.
>> >
>> > Wei Peng
>> > Graduate Student at Rensselaer Polytechnic Institute
>> > Department of Materials Science and Engineering
>> > 110 8th Street, Troy, NY 12180
>> >
>> > On Tue, Apr 17, 2018 at 9:27 AM, Axel Kohlmeyer <akohlmey@...24...>
>> > wrote:
>> >>
>> >> On Mon, Apr 16, 2018 at 8:28 PM, Wei Peng <pengwrpi2@...24...> wrote:
>> >> > Dear Axel,
>> >> >
>> >> > Thank you so much for your prompt response!
>> >> >
>> >> > Can you tell me how to correctly merge all the reduction?
>> >> >
>> >> > I tried this:
>> >> >
>> >> > compute ftotal all reduce sum v_fx1 v_fy1 v_fz1 v_fx2 v_fy2 v_fz2
>> >> > v_fx3
>> >> > v_fy3 v_fz3 v_fx4 v_fy4 v_fz4 v_fx5 v_fy5 v_fz5 v_fx6 v_fy6 v_fz6
>> >> > v_fx7
>> >> > v_fy7 v_fz7 v_fx8 v_fy8 v_fz8
>> >> >
>> >> > And I found c_ftotal[1] is wrong, not the same as c_fxsum1 ( which is
>> >> > from
>> >> > "compute fxsum1 half1 reduce sum v_fx1"). Ideally, I hope v_fx1 can
>> >> > be
>> >> > only
>> >> > defined for group half1, but I don't know how to limit scope of
>> >> > per-atom
>> >> > variable to a fraction of atoms in the simulation box.
>> >> >
>> >> > I am thinking about set fx1 value for all other atoms to be zero
>> >> > explicitly
>> >> > (but I still don't know how), and do the merge of reduction as I
>> >> > posted
>> >> > above. But this is clearly not an elegant solution. What advice do
>> >> > you
>> >> > have
>> >> > for either defining v_fx1 exclusively on atoms of half1 group or
>> >> > doing
>> >> > the
>> >> > merge of reduction differently?
>> >>
>> >> you need to use the gmask(group ID) function in the individual atom
>> >> style variables that you want to sum over to select the atoms by
>> >> group. gmask() is 1 for atoms in a group and 0 for atoms outside, but
>> >> as a per-atom function, it runs perfectly in parallel.
>> >>
>> >> axel.
>> >>
>> >>
>> >> >
>> >> > Thanks again,
>> >> > Wei
>> >> >
>> >> >
>> >> > Wei Peng
>> >> > Graduate Student at Rensselaer Polytechnic Institute
>> >> > Department of Materials Science and Engineering
>> >> > 110 8th Street, Troy, NY 12180
>> >> >
>> >> > On Mon, Apr 16, 2018 at 5:59 PM, Axel Kohlmeyer <akohlmey@...24...>
>> >> > wrote:
>> >> >>
>> >> >> On Mon, Apr 16, 2018 at 4:58 PM, Wei Peng <pengwrpi2@...24...>
>> >> >> wrote:
>> >> >> > Dear LAMMPS administrators or users,
>> >> >> >
>> >> >> > I am simulating a system with 8 nanoparticles containing
>> >> >> > Lennard-Jones
>> >> >> > atoms
>> >> >> > connected by FENE bonds. Every step, I applied a force to each
>> >> >> > atom
>> >> >> > that
>> >> >> > are
>> >> >> > contained in half of each nanoparticle. The force applied to every
>> >> >> > atom
>> >> >> > all
>> >> >> > points to the center of the mass of the whole nanoparticle. To
>> >> >> > conserve
>> >> >> > the
>> >> >> > momentum, I added an opposite force that is applied to the rest of
>> >> >> > the
>> >> >> > system. To take the additional energy out of the system, I used
>> >> >> > Nose-Hoover
>> >> >> > thermostat.
>> >> >> >
>> >> >> > I used fix addforce command to apply the extra force. It turns out
>> >> >> > it
>> >> >> > works
>> >> >> > very well, and it can give the expected result. However, the
>> >> >> > efficiency
>> >> >> > of
>> >> >> > the computation is significantly slowed down by this additional
>> >> >> > force
>> >> >> > (~20
>> >> >> > times slower). I guess the low efficiency is mainly caused by the
>> >> >> > communication cost, but it turns out not the case (according to
>> >> >> > the
>> >> >> > output
>> >> >> > file produced by LAMMPS).
>> >> >>
>> >> >> that output only includes cost of communication for regular known
>> >> >> communication points
>> >> >> does not include time spend in communication that is included in
>> >> >> variable updates or computes.
>> >> >>
>> >> >> the obvious cost is for compute reduce. you should be able to cut
>> >> >> this
>> >> >> cost by a significant margin by doing one reduction over three
>> >> >> values
>> >> >> instead of three reductions over one value.
>> >> >> ...and since you have seven more nanoparticles, you can reduce that
>> >> >> cost even by combining all reductions into one.
>> >> >> so right now you seem to be doing 24 reductions, which can be
>> >> >> handled
>> >> >> by just one compute reduce.
>> >> >>
>> >> >> another "hidden" reduction operation is in the count(groupID)
>> >> >> function. if the number doesn't change over the course of a run, you
>> >> >> can cache it with something like this:
>> >> >>
>> >> >> variable nliquid equal $(ncount(liquid))
>> >> >>
>> >> >> so by avoiding redundant reductions you should be able to
>> >> >> significantly reduce the computational effort.
>> >> >>
>> >> >> axel.
>> >> >>
>> >> >> >
>> >> >> > I have attached the input file and the output file here. The
>> >> >> > version
>> >> >> > of
>> >> >> > LAMMPS is Aug. 2017.
>> >> >> >
>> >> >> > Can you give me any advice on accelerating the computation. Thank
>> >> >> > you
>> >> >> > for
>> >> >> > your help!
>> >> >> >
>> >> >> > Below is the input file:
>> >> >> >
>> >> >> >
>> >> >> > compute          coord1 np1 property/atom xu yu zu
>> >> >> > compute          c1 np1 com
>> >> >> >
>> >> >> > # np1 is the group id of nanoparticle 1.
>> >> >> >
>> >> >> > variable         famp equal "0.1"
>> >> >> >
>> >> >> > variable         dirx1 atom "c_coord1[1]-c_c1[1]"
>> >> >> > variable         diry1 atom "c_coord1[2]-c_c1[2]"
>> >> >> > variable         dirz1 atom "c_coord1[3]-c_c1[3]"
>> >> >> > variable         diramp1 atom "sqrt(v_dirx1^2 + v_diry1^2 +
>> >> >> > v_dirz1^2)"
>> >> >> > variable         fx1 atom "v_famp*v_dirx1/v_diramp1"
>> >> >> > variable         fy1 atom "v_famp*v_diry1/v_diramp1"
>> >> >> > variable         fz1 atom "v_famp*v_dirz1/v_diramp1"
>> >> >> > compute          fxsum1 half1 reduce sum v_fx1
>> >> >> > compute          fysum1 half1 reduce sum v_fy1
>> >> >> > compute          fzsum1 half1 reduce sum v_fz1
>> >> >> > fix              addfnp1 half1 addforce v_fx1 v_fy1 v_fz1
>> >> >> >
>> >> >> > # "half1" is the group id of the half of nanoparticle 1 that was
>> >> >> > applied
>> >> >> > a
>> >> >> > force to
>> >> >> > # I did the same thing for other 7 nanoparticles, which was not
>> >> >> > posted
>> >> >> > here.
>> >> >> >
>> >> >> > variable         fxliquid equal "-1.0
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > *(c_fxsum1+c_fxsum2+c_fxsum3+c_fxsum4+c_fxsum5+c_fxsum6+c_fxsum7+c_fxsum8)/count(liquid)"
>> >> >> >
>> >> >> > variable         fyliquid equal "-1.0 *
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > (c_fysum1+c_fysum2+c_fysum3+c_fysum4+c_fysum5+c_fysum6+c_fysum7+c_fysum8)/count(liquid)"
>> >> >> >
>> >> >> > variable         fzliquid equal "-1.0 *
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > (c_fzsum1+c_fzsum2+c_fzsum3+c_fzsum4+c_fzsum5+c_fzsum6+c_fzsum7+c_fzsum8)/count(liquid)"
>> >> >> >
>> >> >> > fix              addliquid liquid addforce v_fxliquid v_fyliquid
>> >> >> > v_fzliquid
>> >> >> >
>> >> >> > # I summed up the forces applied to the 8 nanoparticles and take
>> >> >> > the
>> >> >> > opposite and then add it to every atom in group "liquid".
>> >> >> >
>> >> >> > Here is the performance report in the output file:
>> >> >> >
>> >> >> > Loop time of 2228.15 on 1024 procs for 100000 steps with 53432
>> >> >> > atoms
>> >> >> >
>> >> >> > Performance: 38776.610 tau/day, 44.880 timesteps/s
>> >> >> > 100.0% CPU use with 1024 MPI tasks x 1 OpenMP threads
>> >> >> >
>> >> >> > MPI task timing breakdown:
>> >> >> > Section |  min time  |  avg time  |  max time  |%varavg| %total
>> >> >> > ---------------------------------------------------------------
>> >> >> > Pair    | 3.1147     | 3.4368     | 3.6571     |   4.7 |  0.15
>> >> >> > Bond    | 0.14084    | 0.74264    | 5.9983     | 113.4 |  0.03
>> >> >> > Neigh   | 152.64     | 153.88     | 154.96     |   4.0 |  6.91
>> >> >> > Comm    | 39.216     | 47.364     | 63.729     |  80.0 |  2.13
>> >> >> > Output  | 8.1437     | 8.8403     | 9.5221     |  13.4 |  0.40
>> >> >> > Modify  | 1982.8     | 1999.6     | 2007.8     |  12.5 | 89.74
>> >> >> > Other   |            | 14.33      |            |       |  0.64
>> >> >> >
>> >> >> > Nlocal:    52.1797 ave 478 max 40 min
>> >> >> > Histogram: 1010 6 2 2 0 2 0 1 0 1
>> >> >> > Nghost:    236.358 ave 1235 max 204 min
>> >> >> > Histogram: 993 16 4 2 3 3 1 1 0 1
>> >> >> > Neighs:    261.92 ave 343 max 63 min
>> >> >> > Histogram: 2 3 4 4 5 66 355 429 145 11
>> >> >> >
>> >> >> > Total # of neighbors = 268206
>> >> >> > Ave neighs/atom = 5.01958
>> >> >> > Ave special neighs/atom = 0.63243
>> >> >> > Neighbor list builds = 22088
>> >> >> > Dangerous builds = 0
>> >> >> > Total wall time: 0:37:10
>> >> >> >
>> >> >> >
>> >> >> > Sincerely,
>> >> >> > Wei
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > ------------------------------------------------------------------------------
>> >> >> > Check out the vibrant tech community on one of the world's most
>> >> >> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> >> >> > _______________________________________________
>> >> >> > lammps-users mailing list
>> >> >> > lammps-users@lists.sourceforge.net
>> >> >> > https://lists.sourceforge.net/lists/listinfo/lammps-users
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
>> >> >> College of Science & Technology, Temple University, Philadelphia PA,
>> >> >> USA
>> >> >> International Centre for Theoretical Physics, Trieste. Italy.
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
>> >> College of Science & Technology, Temple University, Philadelphia PA,
>> >> USA
>> >> International Centre for Theoretical Physics, Trieste. Italy.
>> >
>> >
>>
>>
>>
>> --
>> Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
>> College of Science & Technology, Temple University, Philadelphia PA, USA
>> International Centre for Theoretical Physics, Trieste. Italy.
>
>



-- 
Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.