Message boards : Number crunching : can't you start from earlier point?
Message board moderation
Author | Message |
---|---|
Send message Joined: 1 Oct 04 Posts: 10 Credit: 684,381 RAC: 0 |
I had a couple of full resolution wu's die because the runtime environment fubarred. too bad you can't roll them back a little instead of throwing out the 500 hours of computer time. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The models restart from the last checkpoint. Unless that was scrambled by any of several ways. e.g. Turning off the power halfway through a checkpoint. And there are many ways of doing this. One is automatic updates which cause an automatic reboot. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368 |
I have only had one full resolution model complete from a backup. There is another thread http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7571 that if you read it all the way through explains some of the whys and wherefores of it. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
We used to advise members to make regular backups (which must be of the complete contents of the BOINC Data directory/folder) but that was when a lot of us were running 160-year models (current Hadcm models are 40 years) and computers were slower than now and it was usual for computers to have just one core or at most two. If one has like ksnash one computer with 4 cores and another with 8 it isn't necessarily a good idea to spend time finding out whether a crashed model will succeed second time around when this means rolling back all the other models to the same backup point. Sometimes the model will crash again at exactly the same point. There's a way one can restore a single model from a multimodel backup but when I read it I thought it looked fiendishly complicated. With multicore computers the best strategy is probably to reduce the likelihood of computer-related crashes while accepting the inevitability of some model-based crashes occurring. * Always exit fully from BOINC before turning off the computer. * Exclude the whole of BOINC from AV scans. * If you overclock make sure you test thoroughly for stability. * Keep an eye occasionally on the tasks you've run in your account and if you see trends you don't like or understand, particularly in models' stderr messages, post on the forum. * Keep an eye on the News thread at the top of Number Crunching - you can subscribe to it to receive email notices of new posts there. * Don't run computers for too long without rebooting. Cpdn news |
©2024 cpdn.org