(CPDN) home page
Thread 'HADSM3-MH Crash and Re-Set?'

Thread 'HADSM3-MH Crash and Re-Set?'

Message boards : Number crunching : HADSM3-MH Crash and Re-Set?
Message board moderation

To post messages, you must log in.


Send message
Joined: 27 Feb 08
Posts: 41
Credit: 1,402,356
RAC: 0
Message 36654 - Posted: 9 Apr 2009, 2:18:15 UTC

My computer froze and I had to re-boot. I was 20% into my HADSM3-MH model, but then when I re-booted the model went back to less than 1% completed.

My HadCM3 and Hadam3p running on other cores seemed to come thorough fine, taking up where they had left off before the computer crash.

Frustrating about the HADSM3-MH, don\'t know what will happen when it tries to trickle but the result had already been uploaded before!
Bob P.
ID: 36654 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,486,251
RAC: 3,883
Message 36662 - Posted: 9 Apr 2009, 13:15:41 UTC - in response to Message 36654.  
Last modified: 9 Apr 2009, 13:22:43 UTC

My computer froze and I had to re-boot. I was 20% into my HADSM3-MH model, but then when I re-booted the model went back to less than 1% completed.

My HadCM3 and Hadam3p running on other cores seemed to come thorough fine, taking up where they had left off before the computer crash.

Frustrating about the HADSM3-MH, don\'t know what will happen when it tries to trickle but the result had already been uploaded before!

Nothing will happen when it trickles up, the server will accept it but it won\'t show as a result because it\'s already had it. The trickles will start showing again once the model reaches the trickle point following the last one already sent. Just check with the running HadSM3-MH graphics/timings though make sure it\'s running with normal graphics (not a blue \'iceworld\') and that the timesteps are the same or nearly the same as previously. If all this is OK, then the model should be OK & eventually get past the point it had previuosly got to.

It\'s always worth taking an occasional backup of the BOINC directory, this then can be used to restart from that point should there be a loss of one or more WU\'s for any reason.

EDIT: Looking at your results, it appears you\'ve aborted them now anyway?
ID: 36662 · Report as offensive     Reply Quote

Send message
Joined: 28 Oct 04
Posts: 64
Credit: 34,444,555
RAC: 0
Message 36663 - Posted: 9 Apr 2009, 17:19:37 UTC

I have a pair of HADSM3fub_kc?? running in an AMD X2 4600. In past use, it peaks out at aprox 700 RAC.

With these runs, it has peaked at about 350, half the usual. I checked, both copies are running, both cores active.

What\'s happening? Any suggestions?

ID: 36663 · Report as offensive     Reply Quote
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 36664 - Posted: 9 Apr 2009, 17:49:25 UTC


Looks like stopped trickling on April 4th after the end of the first phase, however, it didn\'t get reported as an error until April 7th, when a new model was downloaded. So three days without credit on one core there. The other model downloaded on March 31st and the new model appear to be moving along fine.
ID: 36664 · Report as offensive     Reply Quote

Send message
Joined: 27 Feb 08
Posts: 41
Credit: 1,402,356
RAC: 0
Message 36665 - Posted: 9 Apr 2009, 18:50:16 UTC - in response to Message 36662.  

It\'s always worth taking an occasional backup of the BOINC directory, this then can be used to restart from that point should there be a loss of one or more WU\'s for any reason.

EDIT: Looking at your results, it appears you\'ve aborted them now anyway?

Thanks for the suggestion to make a backup. And yes, I aborted the work unit just out of frustration I\'m afraid. May replace it with something else.

Thanks again.
Bob P.
ID: 36665 · Report as offensive     Reply Quote

Send message
Joined: 28 Oct 04
Posts: 64
Credit: 34,444,555
RAC: 0
Message 36666 - Posted: 9 Apr 2009, 19:21:38 UTC


Looks like stopped trickling on April 4th after the end of the first phase, however, it didn\'t get reported as an error until April 7th, when a new model was downloaded. So three days without credit on one core there. The other model downloaded on March 31st and the new model appear to be moving along fine.\"

Thanks. The new numbers show it up at 421. But it was very mysterious, atypical for my systems.

ID: 36666 · Report as offensive     Reply Quote
Volunteer moderator

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36676 - Posted: 10 Apr 2009, 20:06:21 UTC

Bob, in the README collection about backups there\'s an item by Pete B describing how to restore a single model from a multimodel backup, which is very useful if just one model fails on a multicore machine. It\'s a rather long procedure but it\'s clearly explained and has been tried and tested. This is a way to avoid unnecessarily restoring and recrunching good models on the other cores.
Cpdn news
ID: 36676 · Report as offensive     Reply Quote

Message boards : Number crunching : HADSM3-MH Crash and Re-Set?
