Questions and Answers : Unix/Linux : Model ran fine up to 53% then ran into trouble
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 Oct 06 Posts: 2 Credit: 1,107,647 RAC: 0 |
Hi all, I\'ve run several models successfully in the past, however hadcm3iozn_cpzl_2000_80_135899450_3 started off like all the others, and ran to 53%. It would continue to run, however it would not progress beyond a certain point in model time. My stats for CPDN went from a 50 degree slope to flatline and stayed there for about 2 months. I had to abort this work unit. questions include: - who else has had this kind of problem? - how should I have handled this (e.g. notify the boinc team with the work in progress files?) so the root cause could be addressed? - would it have been better to figure out how to restart the unit from the beginning? - do I just abort and let it get another unit? |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
Paul, Could you post the result id or unhide your PC? We\'ll then be able to have a look at the model in more detail. Iain |
Send message Joined: 22 Oct 06 Posts: 2 Credit: 1,107,647 RAC: 0 |
Paul, Here\'s the work unit (unit ID is 6106824) http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6106824 And here\'s the cpu ID (493017) http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=493017 Is there an easy way to restart a work unit from the start (delete some key status file and restart the app)? I had a power failure at the wrong time last Dec and it might have gotten into a state it couldn\'t recover from. If I recall correctly, it doesn\'t (or rather didn\'t) compute past 10:30am somewhere in the year 2043. Good luck. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Is there an easy way to restart a work unit from the start Yes. From a backup of the BOINC folder that was made BEFORE the problem. Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
That task last trickled on 27 Dec and you aborted it yesterday Paul. The best way to avoid this sort of thing happening in the future is for you to choose a method for regularly backing up the contents of your boinc folder. So if a model goes wrong, you restore a backup and crunch from that point. Choose a method from the README about backups (link in my signature). I use Les\'s easy manual method which only takes a few minutes. Nobody wants to have to go back to the start of a problem model, but with regular backups you can restore and crunch again from the last backup point ie not much computer time is wasted. Cpdn news |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
... I had a power failure at the wrong time last Dec and it might have gotten into a state it couldn\'t recover from. Actually, Paul, it may be a duff model. The next model to yours in the work unit has stuck at the same point (6106824). It may be a coincidence, of course. So, it may be that you couldn\'t have done anything differently. These later 5.44 models aren\'t supposed to loop, so people don\'t look for it any more ... |
©2024 cpdn.org