climateprediction.net (CPDN) home page
Thread 'Lost Work Units'

Thread 'Lost Work Units'

Message boards : Number crunching : Lost Work Units
Message board moderation

To post messages, you must log in.

AuthorMessage
Profileold_user97788
Avatar

Send message
Joined: 11 Sep 05
Posts: 4
Credit: 272,832
RAC: 0
Message 37878 - Posted: 23 Aug 2009, 12:10:48 UTC

I had a hard drive failure last spring and 2 wu\'s were lost. I thought if I reported it here we could get them back into the server for someone to run instead of waiting for them to time out. They are, task ID -7891348 + WU ID -6206148
task ID -7872606 + WU ID -6203052
Hope this helps.
Bruce ;-p
ID: 37878 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 37879 - Posted: 23 Aug 2009, 13:02:20 UTC - in response to Message 37878.  

Thanks for thinking of that, Bruce.

The process by which the project re-issues work units is rather opaque (to me, at least) - and the very long models, such as the ones you had, may never be re-issued at all. For those work units they really want to finish, I suppose from time to time they check to see which work units have completions or are likely to have completions and reissue the rest. I have certainly come across repeat batches of some of the shorter models. In any event, the project is a statistical exercise which doesn\'t expect or require every work unit to be complete. And some information is sent back to the server during the course of the model run, so the project can do quite a lot of what they need even if a model doesn\'t finish.

Bad luck with the disk failure.
ID: 37879 · Report as offensive     Reply Quote
Profileold_user97788
Avatar

Send message
Joined: 11 Sep 05
Posts: 4
Credit: 272,832
RAC: 0
Message 37898 - Posted: 24 Aug 2009, 14:09:37 UTC - in response to Message 37879.  

Thanks for thinking of that, Bruce.

The process by which the project re-issues work units is rather opaque (to me, at least) - and the very long models, such as the ones you had, may never be re-issued at all. For those work units they really want to finish, I suppose from time to time they check to see which work units have completions or are likely to have completions and reissue the rest. I have certainly come across repeat batches of some of the shorter models. In any event, the project is a statistical exercise which doesn\'t expect or require every work unit to be complete. And some information is sent back to the server during the course of the model run, so the project can do quite a lot of what they need even if a model doesn\'t finish.

Bad luck with the disk failure.


Thanks for quick reply Iain, I noticed a few others that I\'ve lost over the years, due to Hard drive failures usually, with the tasks being so big they still havent timed out yet. I guess I shouldnt worry too much about them then, but if you would like a complete list, thats not a problem, just ask, ok.
Bruce
ID: 37898 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 37904 - Posted: 24 Aug 2009, 16:03:26 UTC

Lost models are not a problem. The server knows about them, and the people running that project will know too, IF they run a script to look through returned/running models.

When this project started, it was set up so that about 60 days of not getting a trickle for a model that hadn\'t completed, was flagged by the server as a \'lost model\'. This allowed for the possibility of the project people to scan through the data base, and re-issue these uncompleted models.

About 2 years ago, this changed.
Now, a batch of models is created from each basic data set, and that\'s it. If models fail now, then they aren\'t re-issued. If enough models aren\'t being completed, then another batch is generated, which may be in the same parameter space as the previous lot, or in a different area.

ID: 37904 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 37905 - Posted: 24 Aug 2009, 16:07:54 UTC

Anyone crunching long tasks like the currently available Mid-Holocene models would do well to back up the complete contents of the Boinc Data folder regularly. If the model crashes the backup can be restored and the same model continued; this is the only way to continue a crashed model to completion.

There\'s a selection of methods in the README collection (see my signature link). I use Les\'s manual backup and restore methods which only take minutes and in my experience are fail-safe as long as you completely exit from Boinc beforehand.
Cpdn news
ID: 37905 · Report as offensive     Reply Quote

Message boards : Number crunching : Lost Work Units

©2024 cpdn.org