climateprediction.net (CPDN) home page
Thread 'can't you start from earlier point?'

Thread 'can't you start from earlier point?'

Message boards : Number crunching : can't you start from earlier point?
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user22286

Send message
Joined: 1 Oct 04
Posts: 10
Credit: 684,381
RAC: 0
Message 45972 - Posted: 19 Apr 2013, 6:30:47 UTC

I had a couple of full resolution wu's die because the runtime environment fubarred. too bad you can't roll them back a little instead of throwing out the 500 hours of computer time.
ID: 45972 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45974 - Posted: 19 Apr 2013, 7:44:56 UTC - in response to Message 45972.  

The models restart from the last checkpoint.
Unless that was scrambled by any of several ways. e.g. Turning off the power halfway through a checkpoint. And there are many ways of doing this. One is automatic updates which cause an automatic reboot.


ID: 45974 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 45975 - Posted: 19 Apr 2013, 8:57:40 UTC - in response to Message 45974.  

I have only had one full resolution model complete from a backup. There is another thread http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7571 that if you read it all the way through explains some of the whys and wherefores of it.
ID: 45975 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 45982 - Posted: 19 Apr 2013, 19:59:16 UTC

We used to advise members to make regular backups (which must be of the complete contents of the BOINC Data directory/folder) but that was when a lot of us were running 160-year models (current Hadcm models are 40 years) and computers were slower than now and it was usual for computers to have just one core or at most two.

If one has like ksnash one computer with 4 cores and another with 8 it isn't necessarily a good idea to spend time finding out whether a crashed model will succeed second time around when this means rolling back all the other models to the same backup point. Sometimes the model will crash again at exactly the same point. There's a way one can restore a single model from a multimodel backup but when I read it I thought it looked fiendishly complicated.

With multicore computers the best strategy is probably to reduce the likelihood of computer-related crashes while accepting the inevitability of some model-based crashes occurring.

* Always exit fully from BOINC before turning off the computer.
* Exclude the whole of BOINC from AV scans.
* If you overclock make sure you test thoroughly for stability.
* Keep an eye occasionally on the tasks you've run in your account and if you see trends you don't like or understand, particularly in models' stderr messages, post on the forum.
* Keep an eye on the News thread at the top of Number Crunching - you can subscribe to it to receive email notices of new posts there.
* Don't run computers for too long without rebooting.
Cpdn news
ID: 45982 · Report as offensive     Reply Quote

Message boards : Number crunching : can't you start from earlier point?

©2024 cpdn.org