Thread 'Lost Work Units'

Author	Message
old_user97788 Send message Joined: 11 Sep 05 Posts: 4 Credit: 272,832 RAC: 0	Message 37878 - Posted: 23 Aug 2009, 12:10:48 UTC I had a hard drive failure last spring and 2 wu\'s were lost. I thought if I reported it here we could get them back into the server for someone to run instead of waiting for them to time out. They are, task ID -7891348 + WU ID -6206148 task ID -7872606 + WU ID -6203052 Hope this helps. Bruce ;-p ID: 37878 · Reply Quote

Iain Inglis Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317	Message 37879 - Posted: 23 Aug 2009, 13:02:20 UTC - in response to Message 37878. Thanks for thinking of that, Bruce. The process by which the project re-issues work units is rather opaque (to me, at least) - and the very long models, such as the ones you had, may never be re-issued at all. For those work units they really want to finish, I suppose from time to time they check to see which work units have completions or are likely to have completions and reissue the rest. I have certainly come across repeat batches of some of the shorter models. In any event, the project is a statistical exercise which doesn\'t expect or require every work unit to be complete. And some information is sent back to the server during the course of the model run, so the project can do quite a lot of what they need even if a model doesn\'t finish. Bad luck with the disk failure. ID: 37879 · Reply Quote

old_user97788 Send message Joined: 11 Sep 05 Posts: 4 Credit: 272,832 RAC: 0	Message 37898 - Posted: 24 Aug 2009, 14:09:37 UTC - in response to Message 37879. Thanks for thinking of that, Bruce. The process by which the project re-issues work units is rather opaque (to me, at least) - and the very long models, such as the ones you had, may never be re-issued at all. For those work units they really want to finish, I suppose from time to time they check to see which work units have completions or are likely to have completions and reissue the rest. I have certainly come across repeat batches of some of the shorter models. In any event, the project is a statistical exercise which doesn\'t expect or require every work unit to be complete. And some information is sent back to the server during the course of the model run, so the project can do quite a lot of what they need even if a model doesn\'t finish. Bad luck with the disk failure. Thanks for quick reply Iain, I noticed a few others that I\'ve lost over the years, due to Hard drive failures usually, with the tasks being so big they still havent timed out yet. I guess I shouldnt worry too much about them then, but if you would like a complete list, thats not a problem, just ask, ok. Bruce ID: 37898 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 37904 - Posted: 24 Aug 2009, 16:03:26 UTC Lost models are not a problem. The server knows about them, and the people running that project will know too, IF they run a script to look through returned/running models. When this project started, it was set up so that about 60 days of not getting a trickle for a model that hadn\'t completed, was flagged by the server as a \'lost model\'. This allowed for the possibility of the project people to scan through the data base, and re-issue these uncompleted models. About 2 years ago, this changed. Now, a batch of models is created from each basic data set, and that\'s it. If models fail now, then they aren\'t re-issued. If enough models aren\'t being completed, then another batch is generated, which may be in the same parameter space as the previous lot, or in a different area. ID: 37904 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 37905 - Posted: 24 Aug 2009, 16:07:54 UTC Anyone crunching long tasks like the currently available Mid-Holocene models would do well to back up the complete contents of the Boinc Data folder regularly. If the model crashes the backup can be restored and the same model continued; this is the only way to continue a crashed model to completion. There\'s a selection of methods in the README collection (see my signature link). I use Les\'s manual backup and restore methods which only take minutes and in my experience are fail-safe as long as you completely exit from Boinc beforehand. Cpdn news ID: 37905 · Reply Quote