climateprediction.net (CPDN) home page
Thread 'Compute error..checkpoints ?'

Thread 'Compute error..checkpoints ?'

Message boards : Number crunching : Compute error..checkpoints ?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Chris Skull
Avatar

Send message
Joined: 27 Dec 05
Posts: 20
Credit: 236,510
RAC: 0
Message 43521 - Posted: 6 Dec 2011, 19:12:17 UTC

Hi,

one question:
why a wu stops completely after a compute error ? Why don't it go back to the last checkpoint ? Isn't this the mission of a checkpoint ?

ID: 43521 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 43522 - Posted: 6 Dec 2011, 19:15:31 UTC

That is done. If you check the stderr messages for the crashed task, you'll see five instances of the error, then the message "too many errors. . ."

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 43522 · Report as offensive     Reply Quote
Chris Skull
Avatar

Send message
Joined: 27 Dec 05
Posts: 20
Credit: 236,510
RAC: 0
Message 43523 - Posted: 6 Dec 2011, 19:39:01 UTC

Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Signal 11 received, exiting...
Called boinc_finish


Hmm ???
ID: 43523 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43525 - Posted: 6 Dec 2011, 20:48:44 UTC - in response to Message 43521.  

Checkpoints only work if all of the necessary data was saved correctly at the checkpoint.
If it wasn't, then there's no way that the model can be automatically restarted.

So you'll have to fix whatever is wrong with your computer that's
a) Causing the models to crash in the first place, and
b) Fix whatever is causing the check pointing to be faulty.

It's something to do with all of the error messages that start Controller:: CPDN process is not running, ...
Possibly running out of memory, but there's lots of possibilities.

If you Google for Signal 11 received, you'll get lots of pages, some of which may apply to your case.



Backups: Here
ID: 43525 · Report as offensive     Reply Quote
Chris Skull
Avatar

Send message
Joined: 27 Dec 05
Posts: 20
Credit: 236,510
RAC: 0
Message 43527 - Posted: 6 Dec 2011, 21:25:14 UTC - in response to Message 43525.  

Checkpoints only work if all of the necessary data was saved correctly at the checkpoint.
If it wasn't, then there's no way that the model can be automatically restarted.

So you'll have to fix whatever is wrong with your computer that's
a) Causing the models to crash in the first place, and
b) Fix whatever is causing the check pointing to be faulty.

Not understandable for me.... i return 11 trickles from this unit... so there must be many good checkpoints...
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=13681388

Searching with Google gives many different results... some says this is a Linux massage..( i use windows ) some says it can also be a error in the wu...

Now, what ever... i had this problem some times earlier... last time all units were finished without problems... so i try a new one :)

ID: 43527 · Report as offensive     Reply Quote
ProfileGreg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 43530 - Posted: 7 Dec 2011, 8:58:40 UTC - in response to Message 43527.  

Chris,

HadCM3N models are prone to crashing at the 25%, 50%, 75%, and 100% points. At these times a zip file is created containing the state of the model at that time. The crash happens during creation of the zip file.

HadCM3N models also can crash if they are running when Boinc shuts down. A work-around for this is to suspend these models before shutting down the computer.

HadCM3N models issued in September, October, and November are more likely to crash than the earlier models issued in June and July.

As far as 'fixing' the problem goes, one thing to check is that your virus scanning program excludes the Boinc data folder. The name and location of this folder varies depending on version of Windows, and how Boinc was installed, so you will have to search for it. It contains two sub-folders called "projects" and "slots", and a file called "client_state.xml" -- as well as other things, of course.

Once your virus scanner is configured to ignore the Boinc data folder, there is nothing else to do about this problem. The project team will just have to make sure the next release of HadCM3N is more robust.
ID: 43530 · Report as offensive     Reply Quote
Chris Skull
Avatar

Send message
Joined: 27 Dec 05
Posts: 20
Credit: 236,510
RAC: 0
Message 43531 - Posted: 7 Dec 2011, 15:55:45 UTC - in response to Message 43530.  

OK, thx Greg, i'll try this :)

ID: 43531 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 43619 - Posted: 29 Dec 2011, 12:26:13 UTC

This happened to me after almost 600 hours:
29-Dec-2011 04:48:31 [climateprediction.net] Not reporting or requesting tasks
*** glibc detected *** ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu: double free or corruption (out): 0x08190dc0 ***
======= Backtrace: =========
/lib/libc.so.6[0xb75967a4]
/lib/libc.so.6(cfree+0x9c)[0xb759808c]
/usr/lib/libstdc++.so.6(_ZdlPv+0x1f)[0xb777f4df]
/usr/lib/libstdc++.so.6(_ZdaPv+0x1b)[0xb777f53b]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8053e8e]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8057bc4]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x804f232]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8050491]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805112c]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805137a]
/lib/libc.so.6(__libc_start_main+0xe5)[0xb7540705]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu(__gxx_personality_v0+0x169)[0x804cb51]
======= Memory map: ========
08048000-080e3000 r-xp 00000000 08:07 1327113 /home/tchersi/BOINC/projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu
080e3000-080e4000 rwxp 0009b000 08:07 1327113 /home/tchersi/BOINC/projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu
080e4000-081a1000 rwxp 080e4000 00:00 0 [heap]
b6f00000-b6f21000 rwxp b6f00000 00:00 0
b6f21000-b7000000 ---p b6f21000 00:00 0
b70ae000-b7529000 rwxs 00000000 08:07 4718618 /home/tchersi/BOINC/slots/3/138085
b7529000-b752a000 rwxp b7529000 00:00 0
b752a000-b767f000 r-xp 00000000 08:06 459873 /lib/libc-2.9.so
b767f000-b7680000 ---p 00155000 08:06 459873 /lib/libc-2.9.so
b7680000-b7682000 r-xp 00155000 08:06 459873 /lib/libc-2.9.so
b7682000-b7683000 rwxp 00157000 08:06 459873 /lib/libc-2.9.so
b7683000-b7686000 rwxp b7683000 00:00 0
b7686000-b76a2000 r-xp 00000000 08:06 460268 /lib/libgcc_s.so.1
b76a2000-b76a3000 r-xp 0001b000 08:06 460268 /lib/libgcc_s.so.1
b76a3000-b76a4000 rwxp 0001c000 08:06 460268 /lib/libgcc_s.so.1
b76a4000-b76a5000 rwxp b76a4000 00:00 0
b76a5000-b76cc000 r-xp 00000000 08:06 460756 /lib/libm-2.9.so
b76cc000-b76cd000 r-xp 00026000 08:06 460756 /lib/libm-2.9.so
b76cd000-b76ce000 rwxp 00027000 08:06 460756 /lib/libm-2.9.so
b76ce000-b77b1000 r-xp 00000000 08:06 632676 /usr/lib/libstdc++.so.6.0.15
b77b1000-b77b2000 ---p 000e3000 08:06 632676 /usr/lib/libstdc++.so.6.0.15
b77b2000-b77b6000 r-xp 000e3000 08:06 632676 /usr/lib/libstdc++.so.6.0.15
b77b6000-b77b7000 rwxp 000e7000 08:06 632676 /usr/lib/libstdc++.so.6.0.15
b77b7000-b77be000 rwxp b77b7000 00:00 0
b77be000-b77c1000 r-xp 00000000 08:06 460748 /lib/libdl-2.9.so
b77c1000-b77c2000 r-xp 00002000 08:06 460748 /lib/libdl-2.9.so
b77c2000-b77c3000 rwxp 00003000 08:06 460748 /lib/libdl-2.9.so
b77c3000-b77d9000 r-xp 00000000 08:06 459887 /lib/libpthread-2.9.so
b77d9000-b77da000 r-xp 00015000 08:06 459887 /lib/libpthread-2.9.so
b77da000-b77db000 rwxp 00016000 08:06 459887 /lib/libpthread-2.9.so
b77db000-b77dd000 rwxp b77db000 00:00 0
b77ed000-b77ee000 rwxp b77ed000 00:00 0
b77ee000-b77ef000 ---p b77ee000 00:00 0
b77ef000-b77f2000 rwxp b77ef000 00:00 0
b77f2000-b77f4000 rwxs 00000000 08:07 4718615 /home/tchersi/BOINC/slots/3/boinc_mmap_file
b77f4000-b77f5000 rwxp b77f4000 00:00 0
b77f5000-b7813000 r-xp 00000000 08:06 460136 /lib/ld-2.9.so
b7813000-b7814000 r-xp 0001d000 08:06 460136 /lib/ld-2.9.so
b7814000-b7815000 rwxp 0001e000 08:06 460136 /lib/ld-2.9.so
bf982000-bf9f2000 rw-p bff8f000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
29-Dec-2011 04:48:36 [climateprediction.net] Computation for task hadcm3n_t4za_1980_40_007412797_1 finished
29-Dec-2011 04:48:36 [climateprediction.net] Output file hadcm3n_t4za_1980_40_007412797_1_2.zip for task hadcm3n_t4za_1980_40_007412797_1 absent
29-Dec-2011 04:48:36 [climateprediction.net] Output file hadcm3n_t4za_1980_40_007412797_1_3.zip for task hadcm3n_t4za_1980_40_007412797_1 absent
29-Dec-2011 04:48:36 [climateprediction.net] Output file hadcm3n_t4za_1980_40_007412797_1_4.zip for task hadcm3n_t4za_1980_40_007412797_1 absent
ID: 43619 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43620 - Posted: 29 Dec 2011, 17:31:48 UTC - in response to Message 43619.  

This is the page for the model in question.

Two things there:
1) It reached time step 518,400, which is the 50% point.
2) The stderr list shows it failed with error 193.
This is the BOINC FAQ page for that error number.

As has been posted many times, these models should NOT be interrupted in any way at the points at which they pause to zip up the data to return it to the server.


Backups: Here
ID: 43620 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 43621 - Posted: 29 Dec 2011, 18:47:04 UTC

I did not interrupt it in any way. It was running high priority.
Tullio
ID: 43621 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43622 - Posted: 29 Dec 2011, 19:35:17 UTC - in response to Message 43621.  

The stderr list contains lots of
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...

which is most likely caused by your setting for Suspend work if CPU usage is above being left at the default value of 25%.
So BOINC, and the model were getting interrupted every time some other computer use was pushing the processor over 25% usage.
Then there are all of the other background processes that run, and sometimes update silently.
Running high priority all the time usually means that your BOINC can't get enough time to run them in normal model. Many reasons for this, including the Resource share that you've given cpdn.

These models run smoothly on my Windows machines, so I know that they can get past these tricky points.

Still, no more coupled ocean models at the moment, and you can always deselect them in your preferences.


Backups: Here
ID: 43622 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 43623 - Posted: 29 Dec 2011, 22:35:53 UTC
Last modified: 29 Dec 2011, 22:54:40 UTC

No, I am running 7 BOINC projects and I cannot leave any of them running in high priority for a long time. So I had to manually suspend hadcm3n in daytime and leave it running only at night, I suspended it every morning. I resumed it in the afternoon of December 28 and I found it crashed the morning after. I have 0, that is no restrictions, on my computing references.
Tullio
I have also a Virtual Machine running Solaris and SETI@home, but it is now suspended since the SETI@home deadlines are very long. But when it runs it runs at nice 19, just like the BOINC_VM Virtual Machine running in Test4Theory@home. This to complete my picture.
ID: 43623 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 43890 - Posted: 29 Feb 2012, 2:13:23 UTC

Still another failure at 50% done. I was not suspending it.
29-Feb-2012 01:00:37 [climateprediction.net] Computation for task hadcm3n_ycwv_1900_40_007519059_1 finished
29-Feb-2012 01:00:37 [climateprediction.net] Output file hadcm3n_ycwv_1900_40_007519059_1_2.zip for task hadcm3n_ycwv_1900_40_007519059_1 absent
29-Feb-2012 01:00:37 [climateprediction.net] Output file hadcm3n_ycwv_1900_40_007519059_1_3.zip for task hadcm3n_ycwv_1900_40_007519059_1 absent
29-Feb-2012 01:00:37 [climateprediction.net] Output file hadcm3n_ycwv_1900_40_007519059_1_4.zip for task hadcm3n_ycwv_1900_40_007519059_1 absent

Tullio
ID: 43890 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 43891 - Posted: 29 Feb 2012, 4:43:35 UTC

I have no explanation for all the failures on your machine -- two successes of fifteen attempts suggests a serious problem -- and it's been a long time since I subjected myself to Linux to address that OS -- but, have you run Memtest86+ and Prime95 Torture Test on that machine recently?

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 43891 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 43892 - Posted: 29 Feb 2012, 9:09:57 UTC - in response to Message 43891.  

A while back I had similar problems on my linux box. Doubling the RAM to 4GB cured it. I have only had regional models to run since then so can't be 100% sure it is fixed but I did have problems with regional models as well though some would complete. It is also worth backing up the BOINC data directory when the 25,50 etc% points are approaching with the full resolution ocean models. Be sure to suspend the models and exit BOINC first. Of course I have never had a model crash when I have a recent back up!
ID: 43892 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 43893 - Posted: 29 Feb 2012, 9:25:58 UTC - in response to Message 43891.  
Last modified: 29 Feb 2012, 9:28:57 UTC

I am running Einstein@home, Seti@home, QMC@home, Test4Theory@home which starts a Virtual Machine BOINC_VM running CERN jobs in Scientific Linux plus a Solaris Virtual Machine with VirtualBox 4.1.8 running now Astropulse by Dotsch, all without errors. I have 8 GB RAM and my SuSE Linux 11.1 is pae so it can use all the 8 GB on my Opteron 1210 which is not very fast but it is very reliable. I have no graphic boards, only on board graphics and I am using both Firefox and SeaMonkey as browsers.
Tullio
ID: 43893 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 43894 - Posted: 29 Feb 2012, 10:42:27 UTC - in response to Message 43893.  

Ok another good theory gone out the window. 4GB/core should be plenty. As you are running so many projects I wonder if it is something to do with the switching? Clearly there is a problem with the code for the hadcm3 code or there would not be so many people with problems with it not exiting cleanly. My only suggestion would be to stick to the regional models for the time being till the code is sorted out which may be quite a while.

Dave
ID: 43894 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 43895 - Posted: 29 Feb 2012, 10:56:46 UTC - in response to Message 43894.  

The only problem I had with hadcm3 is that it would run in high priority mode, so Test4theory@home would not start since it is multithreading. So I had to suspend hadcm3 in order to let Test4Theory start its 60 minutes session, then I resumed it. Other BOINC projects and also Solaris could coexist with it running on one core. When I saw that all projects were ready to start save hadcm3 running I understood that the time had arrived for Test4Theory to start and I suspended hadcm3. But I was careful not to suspend hadcm3 nearing 25% and also 50%. But it never went besides that point.
Tullio
ID: 43895 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43896 - Posted: 29 Feb 2012, 13:28:04 UTC

Tullio

It would seem that the combination of projects that you're running are not conducive to the safe completion of the hadcm3n Coupled Ocean models.

Please change your project preferences so that you don't get these in future, and only run the regional models. And only run one of the regional models at a time.


Backups: Here
ID: 43896 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 43897 - Posted: 29 Feb 2012, 16:08:24 UTC - in response to Message 43896.  

Tullio

It would seem that the combination of projects that you're running are not conducive to the safe completion of the hadcm3n Coupled Ocean models.

Please change your project preferences so that you don't get these in future, and only run the regional models. And only run one of the regional models at a time.


OK
ID: 43897 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Compute error..checkpoints ?

©2024 cpdn.org