LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Mailing List Archives
Re: [lammps-users] GPU compiled but binary sleeps...
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lammps-users] GPU compiled but binary sleeps...


From: "Meij, Henk" <hmeij@...1881...>
Date: Thu, 3 Aug 2017 17:42:07 +0000

Oh goodness. My bad. Your comment led me to assess my PATH env while I was focused on my LIB env. It's all working now. My cuda toolkit is on a nfs mount /share/cuda  while cuda itself is in /usr/local on each node. I was missing a bin/ location in my job submission script. ldd showed no missing libs which puzzled me (too much).


And yes other programs are running, I could observe my lammps gpu job starting on gpu, then hang.


-Henk


[hmeij@...7038... colloid-gpu]$ nvidia-smi                                           
Thu Aug  3 09:04:33 2017                                                      
+------------------------------------------------------+                      
| NVIDIA-SMI 4.304.54   Driver Version: 304.54         |                      
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m               | 0000:02:00.0     Off |                    0 |
| N/A   29C    P0    53W / 225W |   2%   84MB / 4799MB |     50%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m               | 0000:03:00.0     Off |                    0 |
| N/A   27C    P8    15W / 225W |   0%   13MB / 4799MB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m               | 0000:83:00.0     Off |                    0 |
| N/A   39C    P0   123W / 225W |  23% 1101MB / 4799MB |     91%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m               | 0000:84:00.0     Off |                    0 |
| N/A   25C    P8    15W / 225W |   0%   13MB / 4799MB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================|
|    0     32206  /share/apps/CENTOS6/lammps/31Mar17/lmp_gpu            69MB   |
|    2     14429  pmemd.cuda.MPI                                                            1086MB   |
+---------------------------------------------------------------------------------------------------------+



From: Axel Kohlmeyer <akohlmey@...24...>
Sent: Wednesday, August 2, 2017 7:40:10 PM
To: Meij, Henk
Cc: lammps-users@lists.sourceforge.net
Subject: Re: [lammps-users] GPU compiled but binary sleeps...
 


On Wed, Aug 2, 2017 at 9:24 AM, Meij, Henk <hmeij@...1881...> wrote:

Should mention I compiled for K20 using cuda v5.0 ... maybe too old?

​hard to say. what is the output of nvc_get_devices ?
can you run any other GPU code on this machine?

axel.​

 


I tried this recipe and did it all without errors to end up with lmp_gnu and lmp_gpu.

Cpu  run runs fine but  gpu run disappears in a nanosleep loop again. Weird.


#http://comsics.usm.my/tlyoon/configrepo/howto/customise_centos/inst_lammps_31Mar17_gnu.txt


Any ideas?

-Henk



From: Meij, Henk
Sent: Monday, July 31, 2017 11:09:27 AM
To: lammps-users@...396...sourceforge.net
Subject: GPU compiled but binary sleeps...
 

Hi all, I compiled 31Mar17 with g++ with some packages and libjpeg, sequence below.


lmp_serial and lmp_mpi (openmpi) compile and execute the colloid example successfully.


Then I compile lmp_auto (for some reason editing lib/gpu/Makefile[auto|linux] have no effect) with CUDA_HOME etc all set. There are no compilation errors, the compilation finishes and lmp_auto is created (which I rename to lmp_gpu_double)


 cd /tmp/lammps/lammps-31Mar17/src
 make yes-gpu; make yes-colloid;  make yes-class2;  make yes-kspace;  make yes-misc;  make yes-molecule

make clean
./Make.py -v -j 2 -p colloid class2 kspace misc molecule gpu -gpu mode=double arch=35 -o gpu_double -a lib-gpu file clean mpi


But when I run gpu colloid example via scheduler lammps starts in the allocated gpu and hangs, here is scheduler invocation (GPUIDX does nothing right now, it is a toggle flag for cpu only or gpu only, but I'm running vanilla in.colloid)


executing /share/apps/CENTOS6/openmpi/1.8.4/bin/mpirun -x LD_LIBRARY_PATH -machinefile /home/hmeij/.lsbatch/mpi_machines.855573 -np 1 /share/apps/CENTOS6/lammps/31Mar17/lmp_gpu_double -suffix gpu -var GPUIDX 1 -in in.colloid -l out.colloid
LAMMPS (31 Mar 2017)   <---- output from job

strace reveals it looping in nanosleeps (and what's with the 284g virt footprint, the gpu process launched is 61mb on gpu)
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20054 hmeij     20   0 23752 2340 1148 S  0.0  0.0   0:00.02 res
20058 hmeij     20   0  103m 1248 1044 S  0.0  0.0   0:00.00 1501267348.8555
20061 hmeij     20   0  103m 1316 1100 S  0.0  0.0   0:00.00 1501267348.8555
20178 hmeij     20   0  104m 1320 1104 S  0.0  0.0   0:00.00 openmpi-mpirun-
20276 hmeij     20   0  139m 3544 2428 S  0.0  0.0   0:00.08 mpirun
20278 hmeij     20   0  284g  43m  38m S  0.0  0.0   0:13.35 lmp_gpu_double

[root@...7028... ~]# strace -p 20278
Process 20278 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
ioctl(16, 0xc020462a, 0x7fffffffa7b0)   = 0
nanosleep({10, 0}, NULL)                = 0
ioctl(16, 0xc020462a, 0x7fffffffa7b0)   = 0
nanosleep({10, 0}, ^C <unfinished ...>
Process 20278 detached

Any pointers as to what may be the cause. Thanks,

-Henk




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
lammps-users mailing list
lammps-users@...396...sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lammps-users




--
Dr. Axel Kohlmeyer  akohlmey@...24...  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.