Message boards : Number crunching : Crashed WU
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Oct 08 Posts: 2 Credit: 932,088 RAC: 0 |
I have a WU that crashed just short of the 50% mark (about 49.8). Forunately, I made a backup merely 10 minutes before the crash. However, everytime I load the backup of the BOINC folder, the WU appears to run fine (gaining progress and making data), but then 10 minutes into the WU it crashes, apparently at the same spot. I also noticed that the error code on this one was significantly different from the usual ones I had: <core_client_version>6.12.34</core_client_version> <![CDATA[ <message> - exit code 193 (0xc1) </message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... 03:11:14 (1500): No heartbeat from core client for 30 sec - exiting Suspended CPDN Monitor - No 'heartbeat' from BOINC... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=5396, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=2148, iMonCtr=1 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3060, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=1668, iMonCtr=1 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=4060, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3492, iMonCtr=1 Model crash detected, will try to restart... CPDN Monitor - Quit request from BOINC... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=2116, iMonCtr=1 Model crash detected, will try to restart... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3988, iMonCtr=1 Model crash detected, will try to restart... CController:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3412, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3956, iMonCtr=1 Model crash detected, will try to restart... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x778E5EAB read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... Cannot serialize file C:\ProgramData\BOINC/projects/climateprediction.net/hadcm3n_o7n8_2060_40_007998308/dataout/shmem_restart.day Signal 11 received, exiting... Called boinc_finish </stderr_txt> ]]> I have an older backup, but it's quite old, and will take a rather long time to reach it's previous state. Is it worth doing it, or is this WU actually broken? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It's gone. just let it go. The researchers want to know when a set of starting values produces results that aren't viable. But these coupled ocean models also appear to have problems at certain points in the processing run. The current work seems to be the much shorter EU models. Backups: Here |
©2024 cpdn.org