LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Mailing List Archives
Re: [lammps-users] Inconsistent error with using read_restart to run simulation post-equilibrium
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lammps-users] Inconsistent error with using read_restart to run simulation post-equilibrium


From: Axel Kohlmeyer <akohlmey@...24...>
Date: Tue, 17 Apr 2018 13:57:21 -0400

On Tue, Apr 17, 2018 at 11:16 AM, Quang Ha <quang.t.ha.20@...24...> wrote:
> Hi all,
>
> I am trying to figure out restarting a simulation with read_restart. There
> seems to be some thing wrong with restarting the simulation even though at
> the beginning everything seems to be re-started just fine. But the error
> that I was getting are different with each run (!) so I am confused as of
> how to start debugging it. Here are some of the results: the previous
> simulation ended at step 3485. The time step where error occurs during the
> restart run is not consistent even though using the exact same script and
> read from the same restart file (restart.equil.mpiio).

have you tried without mpiio?
mpiio is only required when running with an extremely large number of
MPI ranks and it is not as much used and thus as well maintained and
the regular i/o restart facility.
if that is also crashing, does it happen with multiple MPI ranks only,
or also with a serial executable.

>
> At one time, it fails pretty late into the post-equilibrium simulation:
> Step v_time
> [...]
>     3843    11027.007
> lmp_mpi: malloc.c:3551: _int_malloc: Assertion `(bck->bk->size &
> NON_MAIN_ARENA) == 0' failed.
> [hyperion:17979] *** Process received signal ***
> [hyperion:17979] Signal: Aborted (6)
> [hyperion:17979] Signal code:  (-6)

this looks like some memory corruption.

>
> Some other time it crashed earleir
> Step v_time
> [...]
>     3487    10005.509
> [hyperion:18092] *** Process received signal ***
> [hyperion:18092] Signal: Segmentation fault (11)
> [hyperion:18092] Signal code:  (128)
> [hyperion:18092] Failing at address: (nil)
>
> or even showing up with some terrifying lines of words:
> https://pastebin.com/iyxA5EHH

those are all messages that could be related to memory corruption.
this just opens a cascade of things reporting issues.

> How should I go around and debug this behaviour? Is this where I have to use
> MPI debugging tools such as Totalview/DDT/VTune?

i would first check your binaries with the memcheck tool of valgrind.
you are using OpenMPI but for checking with valgrind i would recommend
to use a serial executable or an MPICH based executable. i generally
prefer OpenMPI, but for valgrind, it has too many tricks causing false
positives.

on the other hand, with OpenMPI, you can launch multiple parallel
tasks inside a debugger when running on a local machine with a simple
trick:

mpirun -np 2 xterm -e gdb --args lmp_mpi -in ....

will spawn two (local) xterms, you give focus to each of them, and
then start debugging each executable with "run". i use this a lot to
debug parallel calculations on my desktop.

axel.

>
> Thanks,
> Quang
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> lammps-users mailing list
> lammps-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lammps-users
>



-- 
Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.