climateprediction.net (CPDN) home page
Thread 'Crashed WU'

Thread 'Crashed WU'

Message boards : Number crunching : Crashed WU
Message board moderation

To post messages, you must log in.

AuthorMessage
Cartoonman

Send message
Joined: 8 Oct 08
Posts: 2
Credit: 932,088
RAC: 0
Message 44286 - Posted: 2 Jun 2012, 19:02:33 UTC
Last modified: 2 Jun 2012, 19:02:53 UTC

I have a WU that crashed just short of the 50% mark (about 49.8). Forunately, I made a backup merely 10 minutes before the crash. However, everytime I load the backup of the BOINC folder, the WU appears to run fine (gaining progress and making data), but then 10 minutes into the WU it crashes, apparently at the same spot.

I also noticed that the error code on this one was significantly different from the usual ones I had:
 <core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
 - exit code 193 (0xc1)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
03:11:14 (1500): No heartbeat from core client for 30 sec - exiting
Suspended CPDN Monitor - No 'heartbeat' from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=5396, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=2148, iMonCtr=1
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3060, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=1668, iMonCtr=1
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=4060, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3492, iMonCtr=1
Model crash detected, will try to restart...
CPDN Monitor - Quit request from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=2116, iMonCtr=1
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3988, iMonCtr=1
Model crash detected, will try to restart...
CController:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3412, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3956, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x778E5EAB read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...

Cannot serialize file C:\ProgramData\BOINC/projects/climateprediction.net/hadcm3n_o7n8_2060_40_007998308/dataout/shmem_restart.day
Signal 11 received, exiting...
Called boinc_finish

</stderr_txt>
]]>


I have an older backup, but it's quite old, and will take a rather long time to reach it's previous state. Is it worth doing it, or is this WU actually broken?
ID: 44286 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44287 - Posted: 2 Jun 2012, 19:40:32 UTC

It's gone. just let it go.
The researchers want to know when a set of starting values produces results that aren't viable.
But these coupled ocean models also appear to have problems at certain points in the processing run.

The current work seems to be the much shorter EU models.


Backups: Here
ID: 44287 · Report as offensive     Reply Quote

Message boards : Number crunching : Crashed WU

©2024 cpdn.org