LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Mailing List Archives
Re: [lammps-users] error: segmentation fault_reax/c_KOKKOS
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lammps-users] error: segmentation fault_reax/c_KOKKOS


From: Mohammad Izadi <izadi0511@...24...>
Date: Thu, 7 Sep 2017 10:11:02 +0430

Dear Axel,

Thank you for your thorough and beneficial response. I’m doing all now.

 

Thanks a lot.

 

Best regards


On Thu, Sep 7, 2017 at 10:09 AM, Mohammad Izadi <izadi0511@...24...> wrote:

Dear Axel,

Thank you for your thorough and beneficial response. I’m doing all now.

 

Thanks a lot.

 

Best regards

=====================

Mohammad Ebrahim izadi,

Department of Chemistry,

Tehran University,

Islamic Republic of Iran,

Phone : +98 – 21 – 61113358

Fax :  +98 – 21 – 66409348


On Thu, Sep 7, 2017 at 1:51 AM, Axel Kohlmeyer <akohlmey@...24...> wrote:
p.s.: another option, that seems to be working in your case, is to compile the kokkos_omp version and then the load balancing commands are not needed and neither is tweaking to a better divisible processor grid, if you only parallelize over threads. try running like this:

lmp_kokkos_omp -in in.input -k on t 25 -pk kokkos newton on neigh half  -sf kk

you'll find some more details about the flags in the manual. but don't take the manual too literal. kokkos is under active development and things can change all the time. as a matter of fact, the manual sections for kokkos are currently being revised for the latest changes.

axel.

On Wed, Sep 6, 2017 at 4:55 PM, Axel Kohlmeyer <akohlmey@...24...> wrote:
please do not attach .rar archives, since not a lot of people have the tools at hand to deal with them, especially on Linux. please use .zip or .tar.gz instead. thanks.

On Wed, Sep 6, 2017 at 4:30 AM, Mohammad Izadi <izadi0511@...24...> wrote:

Dear Axel,

Thank you for your helps. I have update my lammps to lammps-11Agus17 and first installed the lmp_mpi with below commands:

make yes-all

make no-lib

make mpi

With this command “ nohup    mpirun    –np   25   ./lmp_mpi   –sf   intel   <   in.input   &” I run my simulation, but it halted with the last error “Segmentation fault”.


​using -sf intel here makes *no* sense. there is no reax support in the USER-INTEL package.

i can run your input, but it will eventually crash with a segmentation fault. it will even segfault eventually with just one processor.

I install the lmp_kokkos_mpi_only with below commands:

make yes-all

make no-lib

make yes-KOKKOS

make kokkos_mpi_only

I run with “nohup  mpirun     -np    25    ./lmp_kokkos_mpi_only   -k   on   -sf    kk    <  in.input &” this run is also stopped with the similar error, so this new version also has this problem.


​this command line cannot work. what should be required is the following:

mpirun -np 25 lmp_kokkos_mpi_only -in in.input -k on -sf kk -pk kokkos newton on

BTW: using 25 processors is not a good choice, as you get a 1x5x5 domain decomposition, using 24 should be faster with a 2x4x3 split into subdomains.

however, even that will crash immediately and upon closer inspection, the reason for it, is your initial system configuration: you have a very "unbalanced" distribution of your atoms, which are mostly located in one corner of your box and will then distribute across your system as time progresses.
this poses two challenges to the reax implementations in LAMMPS:

1) the implementation in USER-REAXC (and its USER-OMP extension) "does not like it" when your environment changes too much, which explains the occasional segfaults, that happen the earlier, the more processors are used and the more unbalanced the distribution of particles into sub-domains is. the (load) imbalance impact can be avoided by using thread parallelization instead of MPI parallelization, e.g. with compiling with OpenMP enabled (i.e. with -fopenmp added to the compiler and linker) or compiling for the kokkos_omp target.

2) the kokkos implementation of reax does not seem to be "hardened" for having "empty" subdomains, and thus crashes immediately. having subdomains without atoms is very inefficient, so people that know how to deal with them avoid them. for example, here the empty subdomains can be avoided by using the balance command (and eventually fix balance) to adjust the subdomain dividing planes.


thus what i suggest you try is the following:

- stick with using the latest LAMMPS version with KOKKOS

- insert right before the "run" command the following two lines:

balance 1.1 shift xyz 5 1.01
fix 0 all balance 500 1.1 shift xyz 5 1.01

- run LAMMPS as follows: mpirun -np 24 ./lmp_kokkos_mpi_only -in in.input -k on -sf kk -pk kokkos newton on

- consider breaking your run into two sections: a) equilibration and b) production. i.e. do a shorter run that will last as long as it needs to have the system adjusted to the (large) volume and reasonable well equilibrated. write out the final configuration as a data file with write data; then do a second, longer run, with your analysis commands included starting from the equilibration data file. this should lower the risk of the reax code running into problems when the system changes too much. in general, the kokkos version is safer, but you have to make certain, that there are no empty subdomains.

with these adjustments, i have been able to run your input deck for over 200000 MD steps on my 8 core desktop without any indications of a problem.

axel.

 

Please help me on this problem. My whole input files are attached to this email. I have some much bigger input files with exactly similar compositions that I want to run them after solving this problem.


I sincerely thank you

Best regard

 =====================

Mohammad Ebrahim izadi,

Department of Chemistry,

Tehran University,

Islamic Republic of Iran,

Phone : +98 – 21 – 61113358

Fax :  +98 – 21 – 66409348


On Wed, Sep 6, 2017 at 12:45 AM, Axel Kohlmeyer <akohlmey@...24...> wrote:


On Mon, Sep 4, 2017 at 11:49 AM, Mohammad Izadi <izadi0511@...24...> wrote:

Dear lammps users,

I installed lammps kokkos_mpi_only (lammps-31Mar17) on a server computer and I run reax/c/KOKKOS-package from the command line:


​please first update to the latest LAMMPS version and test if your issue persists. 
if yes, please provide a full input deck, so people elsewhere can reproduce this issue and debug it.

axel.​

 

  nohup  mpirun     -np    2    ./lmp_kokkos_mpi_only   -k   on   -sf    kk    <  in.input &

My system has 2073 atoms and it is in a gas phase. When I have a smaller system (e.g. a system with 200 atom), it works without any error. My input file is as below:

=================

echo            both

units                       real

newton            on

atom_style            charge

dimension       3

boundary        p p p

#read_restart    restart22

restart 500      restart11 restart22

read_data              Silica2073.data

pair_style              reax/c NULL

pair_coeff              * * ffield.reax.input C H O N S Si Na Ar

neighbor                2 bin

neigh_modify      every 5 delay 0 check no

velocity          all create 2100 235485 mom yes rot yes

fix                       1 all nvt temp 2100.0 2100.0 100.0

fix                   2 all qeq/reax 1 0.0 10.0 1e-6 reax/c

fix                  4 all reax/c/species 10 10 250 species.txt

fix                  6 all efield 0.0001 0.0 0.0

fix_modify     6 energy yes

fix                  7 all reax/c/bonds 250 bonds.reaxc

compute reax all pair reax/c

variable eb             equal c_reax[1]

variable ea             equal c_reax[2]

variable elp            equal c_reax[3]

variable emol        equal c_reax[4]

variable ev             equal c_reax[5]

variable epen         equal c_reax[6]

variable ecoa         equal c_reax[7]

variable ehb           equal c_reax[8]

variable et              equal c_reax[9]

variable eco           equal c_reax[10]

variable ew            equal c_reax[11]

variable ep             equal c_reax[12]

variable efi             equal c_reax[13]

variable eqeq         equal c_reax[14]

thermo_style    custom  step  temp  atoms  etotal  ke  pe  v_eb  v_ea  v_elp  v_emol  v_ev  v_epen v_ecoa  v_ehb  v_et  v_eco  v_ew  v_ep  v_efi  v_eqeq  density  vol  press

thermo          250

timestep 0.1

dump                     1 all xyz  250 dumpnvt.xyz

run                          4000000

========================

Also, when I use a single core run, it doesn’t stop, but with multi core runs and large systems (2073 atom) instantly it stop with the bottom error:

=========================================

WARNING: Fixes cannot send data in Kokkos communication, switching to classic communication (../comm_kokkos.cpp:382)

[cschpc:169783] *** Process received signal ***

[cschpc:169783] Signal: Segmentation fault (11)

[cschpc:169783] Signal code: Address not mapped (1)

[cschpc:169783] Failing at address: (nil)

[cschpc:169783] [ 0] /lib64/libpthread.so.0() [0x3f6940f710]

[cschpc:169783] [ 1] ./lmp_kokkos_mpi_only(_ZN6Kokkos12parallel_forINS_11RangePolicyIJNS_6SerialEN9LAMMPS_NS27PairReaxFindBondSpeciesZeroEEEENS3_15PairReaxCKokkosIS2_EEEEvRKT_RKT0_RKSsPNS_4Impl9enable_ifIXntsrNSG_11is_integralIS8_EE5valueEvE4typeE+0x268) [0x17a1bc8]

[cschpc:169783] [ 2] ./lmp_kokkos_mpi_only(_ZN9LAMMPS_NS15PairReaxCKokkosIN6Kokkos6SerialEE15FindBondSpeciesEv+0xb0) [0x17aa0d0]

[cschpc:169783] [ 3] ./lmp_kokkos_mpi_only(_ZN9LAMMPS_NS15PairReaxCKokkosIN6Kokkos6SerialEE7computeEii+0x34a4) [0x17e26e4]

[cschpc:169783] [ 4] ./lmp_kokkos_mpi_only(_ZN9LAMMPS_NS12VerletKokkos5setupEv+0x6aa) [0x1a6b43a]

[cschpc:169783] [ 5] ./lmp_kokkos_mpi_only(_ZN9LAMMPS_NS3Run7commandEiPPc+0x65e) [0x1a2271e]

[cschpc:169783] [ 6] ./lmp_kokkos_mpi_only(_ZN9LAMMPS_NS5Input15command_creatorINS_3RunEEEvPNS_6LAMMPSEiPPc+0x26) [0xcfcc66]

[cschpc:169783] [ 7] ./lmp_kokkos_mpi_only(_ZN9LAMMPS_NS5Input15execute_commandEv+0x7e7) [0xcfb0f7]

[cschpc:169783] [ 8] ./lmp_kokkos_mpi_only(_ZN9LAMMPS_NS5Input4fileEv+0x317) [0xcfbc57]

[cschpc:169783] [ 9] ./lmp_kokkos_mpi_only(main+0x46) [0xd136c6]

[cschpc:169783] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f6881ed5d]

[cschpc:169783] [11] ./lmp_kokkos_mpi_only() [0x6adfd1]

[cschpc:169783] *** End of error message ***

--------------------------------------------------------------------------

mpirun noticed that process rank 0 with PID 169783 on node cschpc.ut.ac.ir exited on signal 11 (Segmentation fault).

--------------------------------------------------------------------------

Is it from the shortage of the ram on the computer?

It does not help my mind. If you have any suggestion about this problem, I will be glad.

 

Thanks in advance for your help

 

Best regard

 

=====================

Mohammad Ebrahim izadi,

Department of Chemistry,

Tehran University,

Islamic Republic of Iran,

Phone : +98 – 21 – 61113358

Fax :  +98 – 21 – 66409348


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
lammps-users mailing list
lammps-users@...655....net
https://lists.sourceforge.net/lists/listinfo/lammps-users




--
Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.




--
Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.



--
Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.