Date: Thu, 24 Jun 1999 19:41:18 +0800 From: David Yeung Organization: HKUST To: John Michalakes Subject: Re: MM5 on PC cluster fails after modifying value in XENNES John Your solution works! I have run the same cases twice and both are successful. Thanks david > David, > > I believe this is the same problem as before, having to do with MPI and > p4. I have one idea to try. > > In the file MPP/RSL/RSL/makefile.linux add -DRSL_SYNCIO to the CFLAGS > line. While still in that directory, type 'make clean' and 'make > linux', then cd back up to top level and make mpp. > > This will cause node zero to send a message to each processor allowing > the processor to send its data to be output. Ordinarily the nodes just > send their data whether node 0 is ready or not. The change takes some > of the stress off MPI to be buffering up these sends before node0 can > pull off the data. > > Please try this and let me know if it helps. > > Thanks, > > John > >> >> Dear John >> >> We have recently modified the XENNES values in mm5.deck from: >> >> XENNES = 4320., 2880., 0., 0., 0., 0., 0., 0., 0., 0. >> >> to >> >> XENNES = 4320., 4320., 0., 0., 0., 0., 0., 0., 0., 0. >> >> >> The execution time becomes much longer but it becomes much more frequently >> to be failed due to the p4_error. The error message is a little bit different >> from last time I reported to you. This time the message displayed from >> the mm5.deck on the screen is: >> >> p4_error: net_recv read: probable EOF on socket: 1 >> >> I am not sure whether it is the same problem as last time. However, even >> I reboot the all cluster PC's this time, it won't help. And the error >> occurs always at the end of the run. I have run the case about 5 times, >> and only once is successful. The successful run took about almost 4 hour to finish, >> and the other runs always failed at around 3:57 or later. >> >> Here is the error messages (from mm5.deck) for my last two runs: >> >> ------------------------- >> running /home/dyeung/mm5/Run/mm5.mpp on 8 LINUX ch_p4 processors >> Created /home/dyeung/mm5/Run/PI1111 >> hqlxcl01 -- rsl_nproc_all 8, rsl_myproc 0 >> mpi02.clhq -- rsl_nproc_all 8, rsl_myproc 1 >> mpi04.clhq -- rsl_nproc_all 8, rsl_myproc 3 >> mpi03.clhq -- rsl_nproc_all 8, rsl_myproc 2 >> mpi05.clhq -- rsl_nproc_all 8, rsl_myproc 4 >> mpi06.clhq -- rsl_nproc_all 8, rsl_myproc 5 >> mpi07.clhq -- rsl_nproc_all 8, rsl_myproc 6 >> mpi08.clhq -- rsl_nproc_all 8, rsl_myproc 7 >> rm_l_4_752: p4_error: interrupt SIGINT: 2 >> rm_l_7_746: p4_error: interrupt SIGINT: 2 >> bm_list_1302: p4_error: interrupt SIGINT: 2 >> rm_l_5_746: p4_error: interrupt SIGINT: 2 >> Command exited with non-zero status 1 >> 10566.13user 1033.14system 3:57:26elapsed 81%CPU (0avgtext+0avgdata 0maxresident)k >> 0inputs+0outputs (16602major+326718minor)pagefaults 115swaps >> -------------------------- >> running /home/dyeung/mm5/Run/mm5.mpp on 8 LINUX ch_p4 processors >> Created /home/dyeung/mm5/Run/PI876 >> hqlxcl01 -- rsl_nproc_all 8, rsl_myproc 0 >> mpi04.clhq -- rsl_nproc_all 8, rsl_myproc 3 >> mpi02.clhq -- rsl_nproc_all 8, rsl_myproc 1 >> mpi03.clhq -- rsl_nproc_all 8, rsl_myproc 2 >> mpi05.clhq -- rsl_nproc_all 8, rsl_myproc 4 >> mpi06.clhq -- rsl_nproc_all 8, rsl_myproc 5 >> mpi07.clhq -- rsl_nproc_all 8, rsl_myproc 6 >> mpi08.clhq -- rsl_nproc_all 8, rsl_myproc 7 >> Broken pipe >> rm_l_5_1230: p4_error: net_recv read: probable EOF on socket: 1 >> rm_l_7_1228: p4_error: interrupt SIGINT: 2 >> rm_l_3_1316: p4_error: interrupt SIGINT: 2 >> rm_l_4_1242: p4_error: interrupt SIGINT: 2 >> rm_l_1_1319: p4_error: interrupt SIGINT: 2 >> bm_list_1067: p4_error: interrupt SIGINT: 2 >> Command exited with non-zero status 1 >> 10556.35user 994.07system 3:57:36elapsed 81%CPU (0avgtext+0avgdata 0maxresident)k >> 0inputs+0outputs (16653major+258186minor)pagefaults 651swaps >> -------------------------- >> Thanks >> >> david >>