Message boards : Number crunching : Lost Work Units
Message board moderation
Author | Message |
---|---|
Send message Joined: 11 Sep 05 Posts: 4 Credit: 272,832 RAC: 0 |
I had a hard drive failure last spring and 2 wu\'s were lost. I thought if I reported it here we could get them back into the server for someone to run instead of waiting for them to time out. They are, task ID -7891348 + WU ID -6206148 task ID -7872606 + WU ID -6203052 Hope this helps. Bruce ;-p |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
Thanks for thinking of that, Bruce. The process by which the project re-issues work units is rather opaque (to me, at least) - and the very long models, such as the ones you had, may never be re-issued at all. For those work units they really want to finish, I suppose from time to time they check to see which work units have completions or are likely to have completions and reissue the rest. I have certainly come across repeat batches of some of the shorter models. In any event, the project is a statistical exercise which doesn\'t expect or require every work unit to be complete. And some information is sent back to the server during the course of the model run, so the project can do quite a lot of what they need even if a model doesn\'t finish. Bad luck with the disk failure. |
Send message Joined: 11 Sep 05 Posts: 4 Credit: 272,832 RAC: 0 |
Thanks for thinking of that, Bruce. Thanks for quick reply Iain, I noticed a few others that I\'ve lost over the years, due to Hard drive failures usually, with the tasks being so big they still havent timed out yet. I guess I shouldnt worry too much about them then, but if you would like a complete list, thats not a problem, just ask, ok. Bruce |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Lost models are not a problem. The server knows about them, and the people running that project will know too, IF they run a script to look through returned/running models. When this project started, it was set up so that about 60 days of not getting a trickle for a model that hadn\'t completed, was flagged by the server as a \'lost model\'. This allowed for the possibility of the project people to scan through the data base, and re-issue these uncompleted models. About 2 years ago, this changed. Now, a batch of models is created from each basic data set, and that\'s it. If models fail now, then they aren\'t re-issued. If enough models aren\'t being completed, then another batch is generated, which may be in the same parameter space as the previous lot, or in a different area. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Anyone crunching long tasks like the currently available Mid-Holocene models would do well to back up the complete contents of the Boinc Data folder regularly. If the model crashes the backup can be restored and the same model continued; this is the only way to continue a crashed model to completion. There\'s a selection of methods in the README collection (see my signature link). I use Les\'s manual backup and restore methods which only take minutes and in my experience are fail-safe as long as you completely exit from Boinc beforehand. Cpdn news |
©2024 cpdn.org