Thread 'HADSM3-MH Crash and Re-Set?'

Author	Message
rbpeake Send message Joined: 27 Feb 08 Posts: 41 Credit: 1,402,356 RAC: 0	Message 36654 - Posted: 9 Apr 2009, 2:18:15 UTC My computer froze and I had to re-boot. I was 20% into my HADSM3-MH model, but then when I re-booted the model went back to less than 1% completed. My HadCM3 and Hadam3p running on other cores seemed to come thorough fine, taking up where they had left off before the computer crash. Frustrating about the HADSM3-MH, don\'t know what will happen when it tries to trickle but the result had already been uploaded before! Regards, Bob P. ID: 36654 · Reply Quote

Pete B Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,296,370 RAC: 10,502	Message 36662 - Posted: 9 Apr 2009, 13:15:41 UTC - in response to Message 36654. Last modified: 9 Apr 2009, 13:22:43 UTC My computer froze and I had to re-boot. I was 20% into my HADSM3-MH model, but then when I re-booted the model went back to less than 1% completed. My HadCM3 and Hadam3p running on other cores seemed to come thorough fine, taking up where they had left off before the computer crash. Frustrating about the HADSM3-MH, don\'t know what will happen when it tries to trickle but the result had already been uploaded before! Nothing will happen when it trickles up, the server will accept it but it won\'t show as a result because it\'s already had it. The trickles will start showing again once the model reaches the trickle point following the last one already sent. Just check with the running HadSM3-MH graphics/timings though make sure it\'s running with normal graphics (not a blue \'iceworld\') and that the timesteps are the same or nearly the same as previously. If all this is OK, then the model should be OK & eventually get past the point it had previuosly got to. It\'s always worth taking an occasional backup of the BOINC directory, this then can be used to restart from that point should there be a loss of one or more WU\'s for any reason. EDIT: Looking at your results, it appears you\'ve aborted them now anyway? ID: 36662 · Reply Quote

old_user27607 Send message Joined: 28 Oct 04 Posts: 64 Credit: 34,444,555 RAC: 0	Message 36663 - Posted: 9 Apr 2009, 17:19:37 UTC I have a pair of HADSM3fub_kc?? running in an AMD X2 4600. In past use, it peaks out at aprox 700 RAC. With these runs, it has peaked at about 350, half the usual. I checked, both copies are running, both cores active. What\'s happening? Any suggestions? BillN ID: 36663 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 36664 - Posted: 9 Apr 2009, 17:49:25 UTC Bill, Looks like http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7782677 stopped trickling on April 4th after the end of the first phase, however, it didn\'t get reported as an error until April 7th, when a new model was downloaded. So three days without credit on one core there. The other model downloaded on March 31st and the new model appear to be moving along fine. ID: 36664 · Reply Quote

rbpeake Send message Joined: 27 Feb 08 Posts: 41 Credit: 1,402,356 RAC: 0	Message 36665 - Posted: 9 Apr 2009, 18:50:16 UTC - in response to Message 36662. It\'s always worth taking an occasional backup of the BOINC directory, this then can be used to restart from that point should there be a loss of one or more WU\'s for any reason. EDIT: Looking at your results, it appears you\'ve aborted them now anyway? Thanks for the suggestion to make a backup. And yes, I aborted the work unit just out of frustration I\'m afraid. May replace it with something else. Thanks again. Regards, Bob P. ID: 36665 · Reply Quote

old_user27607 Send message Joined: 28 Oct 04 Posts: 64 Credit: 34,444,555 RAC: 0	Message 36666 - Posted: 9 Apr 2009, 19:21:38 UTC \"Bill, Looks like http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7782677 stopped trickling on April 4th after the end of the first phase, however, it didn\'t get reported as an error until April 7th, when a new model was downloaded. So three days without credit on one core there. The other model downloaded on March 31st and the new model appear to be moving along fine.\" Thanks. The new numbers show it up at 421. But it was very mysterious, atypical for my systems. BillN ID: 36666 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 36676 - Posted: 10 Apr 2009, 20:06:21 UTC Bob, in the README collection about backups there\'s an item by Pete B describing how to restore a single model from a multimodel backup, which is very useful if just one model fails on a multicore machine. It\'s a rather long procedure but it\'s clearly explained and has been tried and tested. This is a way to avoid unnecessarily restoring and recrunching good models on the other cores. Cpdn news ID: 36676 · Reply Quote