Message boards : Number crunching : No Trickles in Task Details
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Mar 06 Posts: 41 Credit: 3,581,078 RAC: 0 |
Stumbled across this \"hole\". It appears that models (some anyway) downloaded early on the 12th have lost their workunit data? Link to my affected machine in the \"Tasks for Computer\" page here. The 4 AM3P models affected downloaded shortly before 1:00 UTC on the 12th. Two are finished, two still running. Go to each details page - and \"no trickles\". This appears to be the normal page format before the first trickle is received. Edit: Forgot to add that my 4 affected models had their first 5 trickles (20%) all successfully received - before disappearing, probably after about one day. Click on the workunit, you get \"can\'t find workunit\" message. The models are trickling up just fine, so no problem at this end. Checked all other team members\' records and yes, one other has 6 AM3P with missing trickles all downloaded early on the 12th. Apologies if this is a known problem but could not find anything about it in this forum. EDIT: Forgot to add that the 4 models had their first 5 trickles (20%) received before disappearing after about one day. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Ian Thanks for your report. I\'ve seen your 4 affected models. Carl did say, I think on Sunday evening or Monday morning, that a group of recent workunits had failed to copy from the old database into the new one. He abandoned attempts to make them copy over and sent affected models a killer trickle which I believe produces code 99. If as I suspect your models are part of the affected group, it looks as if the killer trickle hasn\'t worked because it should kill running models the next time they contact the server. I\'ll report your post to Carl. In the meantime could you please see whether you can download a couple of new models and then suspend your remaining two from 12 Aug while we find out what you should do with them. (If you suspend them first Boinc won\'t let you fetch new work.) Cpdn news |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Ian, you said \'Two are finished, two still running.\' Have the remaining two trickled since Sunday 6 pm UK time? Cpdn news |
Send message Joined: 15 Mar 06 Posts: 41 Credit: 3,581,078 RAC: 0 |
Ian, you said \'Two are finished, two still running.\' Have the remaining two trickled since Sunday 6 pm UK time? Sorry, Mo! I\'ve been busy elsewhere since my post and did not get back here until now. The other 2 models that downloaded on 12/8 both finished early today (19th). All 4 have been reported as successfully completed wiith full credits. Just no trickle data. In BOINC client, trickles were being sent as normal all along. No strange messages. Meantime all 4 models have been replaced with another 4 AM3P\'s including a a pair which replaced the 2 that completed early today. All 4 have trickles correctly recorded. Sorry again I did not get back sooner to try what you suggested. |
Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0 |
Ian Not quite \"full credit\". They got 2079.00 for 72,000 time steps. Not the full 2081.77 for 72,096 time steps. Your previous tasks back in July got their full credits. Keith. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Tolu posted yesterday in a thread about HadAM3P to say that he\'ll look into the final missing trickles which are affecting all of us. As long as the three zip files upload, the model data\'s gone home. Ian, your 4 models downloaded on 12 Aug came from a second smaller black hole that happened when Carl upgraded the CPDN database. The big hole was from April this year. It\'s been fully restored. The smaller hole consists (if I\'ve understood properly) of models downloaded around 12 Aug after the old database had been upgraded but before newly downloaded models were fed into the new database. Last night when we reported your problem Carl did a quick restore of these missing models into the new database, but, for speed, minus their trickle records and apparently still minus their WU pages. He did it this quick way to avoid disabling the server data program again for long. The trickles don\'t transfer scientific data which is all in the zip files. Anyway, your final zip files all transferred properly so these models\' data is safely home. Thanks for reporting this problem as your post got another post-upgrade problem sorted out. Cpdn news |
Send message Joined: 15 Mar 06 Posts: 41 Credit: 3,581,078 RAC: 0 |
Thank you, Mo. :-) [quote] ... Anyway, your final zip files all transferred properly so these models\' data is safely home. ... quote] Good, that\'s the main thing! Back there, Keith correctly pointed out that my 4 models all \"completed\" without the final small \"post processing\" trickle - i.e. 2,079 credits ( 72,000 steps) instead of 2,081.77 (72,096 steps). There was an explanation in one of the 2 boards a few weeks ago - just can\'t locate the post (probably one of yours, Mo?). As far as I\'m concerned, and for our team\'s models completed stats), a HadAM3P finishes (with all research data) when it reaches step 72,000 - yes? Just checked my own records for 104 completed HadAM3P models. Only 12, including (allegedly) the recent 4, did not send the final trickle. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
|
Send message Joined: 15 Mar 06 Posts: 41 Credit: 3,581,078 RAC: 0 |
I don\'t think I\'ve ever attempted to explain the missing final trickle. But see the posts dated 18 Aug by Carl and Tolu here. Okay, thanks again. That explains a lot. This Windows/Intel user never went into the long MAC thread - in my defence! The following one liner in the MAC thread is good enough for me - what Carl said. Meantime, will keep a closer eye on how every HadAM3P model finishes (next up tomorrow). End of thread :-) |
Send message Joined: 9 Jan 05 Posts: 30 Credit: 434,469 RAC: 0 |
For what it\'s worth, this task seems similar. It\'s trickled just twice, got credit but neither trickle has appeared. I\'ve had the task suspended since, thinking this was a temporary part of the server issues. I believe the last trickle was on August 16th. I also see see a trickle_down_0 file in the slot dir with that date, containing <abort>cleanup</abort>, is that a killer trickle? fwiw I resumed the task for a bit and it shows no sign of stopping. I\'ve been wondering if it was safe to go back in the CPDN water wrt server and database issues. What do you advise here, should I let this short task go to completion or abort it and try another? Thanks [B^S] sTrey |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
A \"killer trickle\" will give an Error 99. It should Abort the task. I resumed the task for a bit From a backup? |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Thanks for your report, sTrey. You downloaded this task on 12 August (same date as iansm\'s tasks). The task\'s workunit #6509003 can\'t be found on the new database. I\'m afraid this WU is part of the smaller \'black hole\'. Carl said about this smaller hole: \'They\'re the ones that are \'killer trickled\' as I think workunits were being created as I was archiving some, so the id\'s are not valid.\' <abort>cleanup</abort> in your slot directory was probably intended to be a killer trickle, but it appears to have been ineffective. When Carl released a killer trickle a few years ago to eliminate a batch of defective BBC models, as far as I know all the models crashed immediately the next time they contacted the server; they all had a -99 error code. But the BBC models all had valid WU IDs. You will need to abort this task, sorry. If you continue crunching it I don\'t think you will receive credit for the trickles and I don\'t think the data will be usable because it won\'t go into in the new database. Cpdn news |
Send message Joined: 9 Jan 05 Posts: 30 Credit: 434,469 RAC: 0 |
Thanks Mo that saves wasted effort. (Les, I had no backup yet, I had just suspended the task for the last week or more, and resumed it for a while to verify it wouldn\'t terminate itself). I\'ve aborted the task and will get another. |
©2024 cpdn.org