Message boards : Number crunching : hadcm3n restart from backup
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I'm totally amazed. Restarted a failed hadcm3n from backup --put it on a virtual machine -- And it got past the fail point and is still running. Never happened before every one I ever restarted from backup failed croaked at the fail point however far back I restored from. Backups finally did some good . This one http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=14098560 is actually working restored from backup and got past the fail point. Totally amazed. Are we still supposed to keep these restored losers running? It's already reported as failed. Whee! |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,841,902 RAC: 5,047 |
It may not be good news. One of the failure modes is a 'zombie' model, in which the model will keep running until it is restarted: it will not, however, produce any trickle files or Zip uploads - so have a look for a trickle upload when it's run long enough to produce one. If it hasn't produced a file then it's useless. If there are trickles and Zip uploads then, as far as the project is concerned, the model is as good as any other. [Edit: The model has produced four trickles since the crash, so it looks good.] Having said that, I have had one success trying to evade a decade crash (at least so far - it may still crash at another decade). That involved restarting from the beginning, before it had even unzipped. My long-standing habit has been to run batches of models in parallel (i.e. starting at the same time, not interleaved), to then allow a new batch to download and suspend the new models. When the old batch has finished a backup is then made of the new batch, which will not even have unzipped. The original reason for doing that was that the backups are much smaller, but perhaps it may also be a work-around on this occasion (though massively inefficient for a multi-processor machine). |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
So far so good. Computed past the next decade ok. Restore from backup good so far at +75% Happy here. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Completed ok uploaded the last big zip file. Task still shows a comp error on the web page but the data got uploaded. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Good job! "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
©2024 cpdn.org