Message boards : Number crunching : Compute error..checkpoints ?
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Dec 05 Posts: 20 Credit: 236,510 RAC: 0 |
Hi, one question: why a wu stops completely after a compute error ? Why don't it go back to the last checkpoint ? Isn't this the mission of a checkpoint ? |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
That is done. If you check the stderr messages for the crashed task, you'll see five instances of the error, then the message "too many errors. . ." "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 27 Dec 05 Posts: 20 Credit: 236,510 RAC: 0 |
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0 Model crash detected, will try to restart... Signal 11 received, exiting... Called boinc_finish Hmm ??? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Checkpoints only work if all of the necessary data was saved correctly at the checkpoint. If it wasn't, then there's no way that the model can be automatically restarted. So you'll have to fix whatever is wrong with your computer that's a) Causing the models to crash in the first place, and b) Fix whatever is causing the check pointing to be faulty. It's something to do with all of the error messages that start Controller:: CPDN process is not running, ... Possibly running out of memory, but there's lots of possibilities. If you Google for Signal 11 received, you'll get lots of pages, some of which may apply to your case. Backups: Here |
Send message Joined: 27 Dec 05 Posts: 20 Credit: 236,510 RAC: 0 |
Checkpoints only work if all of the necessary data was saved correctly at the checkpoint. Not understandable for me.... i return 11 trickles from this unit... so there must be many good checkpoints... http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=13681388 Searching with Google gives many different results... some says this is a Linux massage..( i use windows ) some says it can also be a error in the wu... Now, what ever... i had this problem some times earlier... last time all units were finished without problems... so i try a new one :) |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
Chris, HadCM3N models are prone to crashing at the 25%, 50%, 75%, and 100% points. At these times a zip file is created containing the state of the model at that time. The crash happens during creation of the zip file. HadCM3N models also can crash if they are running when Boinc shuts down. A work-around for this is to suspend these models before shutting down the computer. HadCM3N models issued in September, October, and November are more likely to crash than the earlier models issued in June and July. As far as 'fixing' the problem goes, one thing to check is that your virus scanning program excludes the Boinc data folder. The name and location of this folder varies depending on version of Windows, and how Boinc was installed, so you will have to search for it. It contains two sub-folders called "projects" and "slots", and a file called "client_state.xml" -- as well as other things, of course. Once your virus scanner is configured to ignore the Boinc data folder, there is nothing else to do about this problem. The project team will just have to make sure the next release of HadCM3N is more robust. |
Send message Joined: 27 Dec 05 Posts: 20 Credit: 236,510 RAC: 0 |
OK, thx Greg, i'll try this :) |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
This happened to me after almost 600 hours: 29-Dec-2011 04:48:31 [climateprediction.net] Not reporting or requesting tasks *** glibc detected *** ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu: double free or corruption (out): 0x08190dc0 *** ======= Backtrace: ========= /lib/libc.so.6[0xb75967a4] /lib/libc.so.6(cfree+0x9c)[0xb759808c] /usr/lib/libstdc++.so.6(_ZdlPv+0x1f)[0xb777f4df] /usr/lib/libstdc++.so.6(_ZdaPv+0x1b)[0xb777f53b] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8053e8e] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8057bc4] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x804f232] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8050491] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805112c] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805137a] /lib/libc.so.6(__libc_start_main+0xe5)[0xb7540705] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu(__gxx_personality_v0+0x169)[0x804cb51] ======= Memory map: ======== 08048000-080e3000 r-xp 00000000 08:07 1327113 /home/tchersi/BOINC/projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu 080e3000-080e4000 rwxp 0009b000 08:07 1327113 /home/tchersi/BOINC/projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu 080e4000-081a1000 rwxp 080e4000 00:00 0 [heap] b6f00000-b6f21000 rwxp b6f00000 00:00 0 b6f21000-b7000000 ---p b6f21000 00:00 0 b70ae000-b7529000 rwxs 00000000 08:07 4718618 /home/tchersi/BOINC/slots/3/138085 b7529000-b752a000 rwxp b7529000 00:00 0 b752a000-b767f000 r-xp 00000000 08:06 459873 /lib/libc-2.9.so b767f000-b7680000 ---p 00155000 08:06 459873 /lib/libc-2.9.so b7680000-b7682000 r-xp 00155000 08:06 459873 /lib/libc-2.9.so b7682000-b7683000 rwxp 00157000 08:06 459873 /lib/libc-2.9.so b7683000-b7686000 rwxp b7683000 00:00 0 b7686000-b76a2000 r-xp 00000000 08:06 460268 /lib/libgcc_s.so.1 b76a2000-b76a3000 r-xp 0001b000 08:06 460268 /lib/libgcc_s.so.1 b76a3000-b76a4000 rwxp 0001c000 08:06 460268 /lib/libgcc_s.so.1 b76a4000-b76a5000 rwxp b76a4000 00:00 0 b76a5000-b76cc000 r-xp 00000000 08:06 460756 /lib/libm-2.9.so b76cc000-b76cd000 r-xp 00026000 08:06 460756 /lib/libm-2.9.so b76cd000-b76ce000 rwxp 00027000 08:06 460756 /lib/libm-2.9.so b76ce000-b77b1000 r-xp 00000000 08:06 632676 /usr/lib/libstdc++.so.6.0.15 b77b1000-b77b2000 ---p 000e3000 08:06 632676 /usr/lib/libstdc++.so.6.0.15 b77b2000-b77b6000 r-xp 000e3000 08:06 632676 /usr/lib/libstdc++.so.6.0.15 b77b6000-b77b7000 rwxp 000e7000 08:06 632676 /usr/lib/libstdc++.so.6.0.15 b77b7000-b77be000 rwxp b77b7000 00:00 0 b77be000-b77c1000 r-xp 00000000 08:06 460748 /lib/libdl-2.9.so b77c1000-b77c2000 r-xp 00002000 08:06 460748 /lib/libdl-2.9.so b77c2000-b77c3000 rwxp 00003000 08:06 460748 /lib/libdl-2.9.so b77c3000-b77d9000 r-xp 00000000 08:06 459887 /lib/libpthread-2.9.so b77d9000-b77da000 r-xp 00015000 08:06 459887 /lib/libpthread-2.9.so b77da000-b77db000 rwxp 00016000 08:06 459887 /lib/libpthread-2.9.so b77db000-b77dd000 rwxp b77db000 00:00 0 b77ed000-b77ee000 rwxp b77ed000 00:00 0 b77ee000-b77ef000 ---p b77ee000 00:00 0 b77ef000-b77f2000 rwxp b77ef000 00:00 0 b77f2000-b77f4000 rwxs 00000000 08:07 4718615 /home/tchersi/BOINC/slots/3/boinc_mmap_file b77f4000-b77f5000 rwxp b77f4000 00:00 0 b77f5000-b7813000 r-xp 00000000 08:06 460136 /lib/ld-2.9.so b7813000-b7814000 r-xp 0001d000 08:06 460136 /lib/ld-2.9.so b7814000-b7815000 rwxp 0001e000 08:06 460136 /lib/ld-2.9.so bf982000-bf9f2000 rw-p bff8f000 00:00 0 [stack] ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] 29-Dec-2011 04:48:36 [climateprediction.net] Computation for task hadcm3n_t4za_1980_40_007412797_1 finished 29-Dec-2011 04:48:36 [climateprediction.net] Output file hadcm3n_t4za_1980_40_007412797_1_2.zip for task hadcm3n_t4za_1980_40_007412797_1 absent 29-Dec-2011 04:48:36 [climateprediction.net] Output file hadcm3n_t4za_1980_40_007412797_1_3.zip for task hadcm3n_t4za_1980_40_007412797_1 absent 29-Dec-2011 04:48:36 [climateprediction.net] Output file hadcm3n_t4za_1980_40_007412797_1_4.zip for task hadcm3n_t4za_1980_40_007412797_1 absent |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
This is the page for the model in question. Two things there: 1) It reached time step 518,400, which is the 50% point. 2) The stderr list shows it failed with error 193. This is the BOINC FAQ page for that error number. As has been posted many times, these models should NOT be interrupted in any way at the points at which they pause to zip up the data to return it to the server. Backups: Here |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
I did not interrupt it in any way. It was running high priority. Tullio |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The stderr list contains lots of CPDN Monitor - Quit request from BOINC... which is most likely caused by your setting for Suspend work if CPU usage is above being left at the default value of 25%. So BOINC, and the model were getting interrupted every time some other computer use was pushing the processor over 25% usage. Then there are all of the other background processes that run, and sometimes update silently. Running high priority all the time usually means that your BOINC can't get enough time to run them in normal model. Many reasons for this, including the Resource share that you've given cpdn. These models run smoothly on my Windows machines, so I know that they can get past these tricky points. Still, no more coupled ocean models at the moment, and you can always deselect them in your preferences. Backups: Here |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
No, I am running 7 BOINC projects and I cannot leave any of them running in high priority for a long time. So I had to manually suspend hadcm3n in daytime and leave it running only at night, I suspended it every morning. I resumed it in the afternoon of December 28 and I found it crashed the morning after. I have 0, that is no restrictions, on my computing references. Tullio I have also a Virtual Machine running Solaris and SETI@home, but it is now suspended since the SETI@home deadlines are very long. But when it runs it runs at nice 19, just like the BOINC_VM Virtual Machine running in Test4Theory@home. This to complete my picture. |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
Still another failure at 50% done. I was not suspending it. 29-Feb-2012 01:00:37 [climateprediction.net] Computation for task hadcm3n_ycwv_1900_40_007519059_1 finished 29-Feb-2012 01:00:37 [climateprediction.net] Output file hadcm3n_ycwv_1900_40_007519059_1_2.zip for task hadcm3n_ycwv_1900_40_007519059_1 absent 29-Feb-2012 01:00:37 [climateprediction.net] Output file hadcm3n_ycwv_1900_40_007519059_1_3.zip for task hadcm3n_ycwv_1900_40_007519059_1 absent 29-Feb-2012 01:00:37 [climateprediction.net] Output file hadcm3n_ycwv_1900_40_007519059_1_4.zip for task hadcm3n_ycwv_1900_40_007519059_1 absent Tullio |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
I have no explanation for all the failures on your machine -- two successes of fifteen attempts suggests a serious problem -- and it's been a long time since I subjected myself to Linux to address that OS -- but, have you run Memtest86+ and Prime95 Torture Test on that machine recently? "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
A while back I had similar problems on my linux box. Doubling the RAM to 4GB cured it. I have only had regional models to run since then so can't be 100% sure it is fixed but I did have problems with regional models as well though some would complete. It is also worth backing up the BOINC data directory when the 25,50 etc% points are approaching with the full resolution ocean models. Be sure to suspend the models and exit BOINC first. Of course I have never had a model crash when I have a recent back up! |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
I am running Einstein@home, Seti@home, QMC@home, Test4Theory@home which starts a Virtual Machine BOINC_VM running CERN jobs in Scientific Linux plus a Solaris Virtual Machine with VirtualBox 4.1.8 running now Astropulse by Dotsch, all without errors. I have 8 GB RAM and my SuSE Linux 11.1 is pae so it can use all the 8 GB on my Opteron 1210 which is not very fast but it is very reliable. I have no graphic boards, only on board graphics and I am using both Firefox and SeaMonkey as browsers. Tullio |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Ok another good theory gone out the window. 4GB/core should be plenty. As you are running so many projects I wonder if it is something to do with the switching? Clearly there is a problem with the code for the hadcm3 code or there would not be so many people with problems with it not exiting cleanly. My only suggestion would be to stick to the regional models for the time being till the code is sorted out which may be quite a while. Dave |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
The only problem I had with hadcm3 is that it would run in high priority mode, so Test4theory@home would not start since it is multithreading. So I had to suspend hadcm3 in order to let Test4Theory start its 60 minutes session, then I resumed it. Other BOINC projects and also Solaris could coexist with it running on one core. When I saw that all projects were ready to start save hadcm3 running I understood that the time had arrived for Test4Theory to start and I suspended hadcm3. But I was careful not to suspend hadcm3 nearing 25% and also 50%. But it never went besides that point. Tullio |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Tullio It would seem that the combination of projects that you're running are not conducive to the safe completion of the hadcm3n Coupled Ocean models. Please change your project preferences so that you don't get these in future, and only run the regional models. And only run one of the regional models at a time. Backups: Here |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
Tullio OK |
©2024 cpdn.org