Message boards : Number crunching : HADSM3-MH Crash and Re-Set?
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Feb 08 Posts: 41 Credit: 1,402,356 RAC: 0 |
My computer froze and I had to re-boot. I was 20% into my HADSM3-MH model, but then when I re-booted the model went back to less than 1% completed. My HadCM3 and Hadam3p running on other cores seemed to come thorough fine, taking up where they had left off before the computer crash. Frustrating about the HADSM3-MH, don\'t know what will happen when it tries to trickle but the result had already been uploaded before! Regards, Bob P. |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,296,370 RAC: 10,502 |
My computer froze and I had to re-boot. I was 20% into my HADSM3-MH model, but then when I re-booted the model went back to less than 1% completed. Nothing will happen when it trickles up, the server will accept it but it won\'t show as a result because it\'s already had it. The trickles will start showing again once the model reaches the trickle point following the last one already sent. Just check with the running HadSM3-MH graphics/timings though make sure it\'s running with normal graphics (not a blue \'iceworld\') and that the timesteps are the same or nearly the same as previously. If all this is OK, then the model should be OK & eventually get past the point it had previuosly got to. It\'s always worth taking an occasional backup of the BOINC directory, this then can be used to restart from that point should there be a loss of one or more WU\'s for any reason. EDIT: Looking at your results, it appears you\'ve aborted them now anyway? |
Send message Joined: 28 Oct 04 Posts: 64 Credit: 34,444,555 RAC: 0 |
I have a pair of HADSM3fub_kc?? running in an AMD X2 4600. In past use, it peaks out at aprox 700 RAC. With these runs, it has peaked at about 350, half the usual. I checked, both copies are running, both cores active. What\'s happening? Any suggestions? BillN |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Bill, Looks like http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7782677 stopped trickling on April 4th after the end of the first phase, however, it didn\'t get reported as an error until April 7th, when a new model was downloaded. So three days without credit on one core there. The other model downloaded on March 31st and the new model appear to be moving along fine. |
Send message Joined: 27 Feb 08 Posts: 41 Credit: 1,402,356 RAC: 0 |
It\'s always worth taking an occasional backup of the BOINC directory, this then can be used to restart from that point should there be a loss of one or more WU\'s for any reason. Thanks for the suggestion to make a backup. And yes, I aborted the work unit just out of frustration I\'m afraid. May replace it with something else. Thanks again. Regards, Bob P. |
Send message Joined: 28 Oct 04 Posts: 64 Credit: 34,444,555 RAC: 0 |
\"Bill, Looks like http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7782677 stopped trickling on April 4th after the end of the first phase, however, it didn\'t get reported as an error until April 7th, when a new model was downloaded. So three days without credit on one core there. The other model downloaded on March 31st and the new model appear to be moving along fine.\" Thanks. The new numbers show it up at 421. But it was very mysterious, atypical for my systems. BillN |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Bob, in the README collection about backups there\'s an item by Pete B describing how to restore a single model from a multimodel backup, which is very useful if just one model fails on a multicore machine. It\'s a rather long procedure but it\'s clearly explained and has been tried and tested. This is a way to avoid unnecessarily restoring and recrunching good models on the other cores. Cpdn news |
©2024 cpdn.org