climateprediction.net (CPDN) home page
Thread 'No Trickles in Task Details'

Thread 'No Trickles in Task Details'

Message boards : Number crunching : No Trickles in Task Details
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 37820 - Posted: 18 Aug 2009, 11:39:21 UTC
Last modified: 18 Aug 2009, 12:26:31 UTC

Stumbled across this \"hole\". It appears that models (some anyway) downloaded early on the 12th have lost their workunit data?

Link to my affected machine in the \"Tasks for Computer\" page here.

The 4 AM3P models affected downloaded shortly before 1:00 UTC on the 12th. Two are finished, two still running.
Go to each details page - and \"no trickles\". This appears to be the normal page format before the first trickle is received.

Edit: Forgot to add that my 4 affected models had their first 5 trickles (20%) all successfully received - before disappearing, probably after about one day.

Click on the workunit, you get \"can\'t find workunit\" message.

The models are trickling up just fine, so no problem at this end.

Checked all other team members\' records and yes, one other has 6 AM3P with missing trickles all downloaded early on the 12th.

Apologies if this is a known problem but could not find anything about it in this forum.

EDIT: Forgot to add that the 4 models had their first 5 trickles (20%) received before disappearing after about one day.
ID: 37820 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 37825 - Posted: 18 Aug 2009, 18:10:11 UTC
Last modified: 18 Aug 2009, 18:12:04 UTC

Hi Ian

Thanks for your report. I\'ve seen your 4 affected models. Carl did say, I think on Sunday evening or Monday morning, that a group of recent workunits had failed to copy from the old database into the new one. He abandoned attempts to make them copy over and sent affected models a killer trickle which I believe produces code 99.

If as I suspect your models are part of the affected group, it looks as if the killer trickle hasn\'t worked because it should kill running models the next time they contact the server.

I\'ll report your post to Carl. In the meantime could you please see whether you can download a couple of new models and then suspend your remaining two from 12 Aug while we find out what you should do with them. (If you suspend them first Boinc won\'t let you fetch new work.)
Cpdn news
ID: 37825 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 37826 - Posted: 18 Aug 2009, 18:24:07 UTC

Ian, you said \'Two are finished, two still running.\' Have the remaining two trickled since Sunday 6 pm UK time?

Cpdn news
ID: 37826 · Report as offensive     Reply Quote
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 37840 - Posted: 19 Aug 2009, 18:02:43 UTC - in response to Message 37826.  

Ian, you said \'Two are finished, two still running.\' Have the remaining two trickled since Sunday 6 pm UK time?


Sorry, Mo! I\'ve been busy elsewhere since my post and did not get back here until now.

The other 2 models that downloaded on 12/8 both finished early today (19th).
All 4 have been reported as successfully completed wiith full credits.
Just no trickle data. In BOINC client, trickles were being sent as normal all along. No strange messages.

Meantime all 4 models have been replaced with another 4 AM3P\'s including a a pair which replaced the 2 that completed early today.
All 4 have trickles correctly recorded.

Sorry again I did not get back sooner to try what you suggested.
ID: 37840 · Report as offensive     Reply Quote
old_user294426

Send message
Joined: 20 Feb 06
Posts: 158
Credit: 1,251,176
RAC: 0
Message 37842 - Posted: 19 Aug 2009, 19:04:15 UTC - in response to Message 37840.  


......... All 4 have been reported as successfully completed wiith full credits. ...........


Ian
Not quite \"full credit\".
They got 2079.00 for 72,000 time steps.
Not the full 2081.77 for 72,096 time steps.

Your previous tasks back in July got their full credits.

Keith.
ID: 37842 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 37845 - Posted: 19 Aug 2009, 20:30:19 UTC
Last modified: 19 Aug 2009, 20:40:40 UTC

Tolu posted yesterday in a thread about HadAM3P to say that he\'ll look into the final missing trickles which are affecting all of us. As long as the three zip files upload, the model data\'s gone home.

Ian, your 4 models downloaded on 12 Aug came from a second smaller black hole that happened when Carl upgraded the CPDN database. The big hole was from April this year. It\'s been fully restored.

The smaller hole consists (if I\'ve understood properly) of models downloaded around 12 Aug after the old database had been upgraded but before newly downloaded models were fed into the new database.

Last night when we reported your problem Carl did a quick restore of these missing models into the new database, but, for speed, minus their trickle records and apparently still minus their WU pages. He did it this quick way to avoid disabling the server data program again for long. The trickles don\'t transfer scientific data which is all in the zip files.

Anyway, your final zip files all transferred properly so these models\' data is safely home.

Thanks for reporting this problem as your post got another post-upgrade problem sorted out.
Cpdn news
ID: 37845 · Report as offensive     Reply Quote
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 37851 - Posted: 20 Aug 2009, 9:13:31 UTC - in response to Message 37845.  

Thank you, Mo. :-)

[quote]
...
Anyway, your final zip files all transferred properly so these models\' data is safely home.
...
quote]

Good, that\'s the main thing!

Back there, Keith correctly pointed out that my 4 models all \"completed\" without the final small \"post processing\" trickle
- i.e. 2,079 credits ( 72,000 steps) instead of 2,081.77 (72,096 steps).

There was an explanation in one of the 2 boards a few weeks ago - just can\'t locate the post (probably one of yours, Mo?).

As far as I\'m concerned, and for our team\'s models completed stats), a HadAM3P finishes (with all research data) when it reaches step 72,000 - yes?
Just checked my own records for 104 completed HadAM3P models. Only 12, including (allegedly) the recent 4, did not send the final trickle.

ID: 37851 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 37852 - Posted: 20 Aug 2009, 9:24:51 UTC

I don\'t think I\'ve ever attempted to explain the missing final trickle. But see the posts dated 18 Aug by Carl and Tolu here.
Cpdn news
ID: 37852 · Report as offensive     Reply Quote
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 37853 - Posted: 20 Aug 2009, 10:51:19 UTC - in response to Message 37852.  

I don\'t think I\'ve ever attempted to explain the missing final trickle. But see the posts dated 18 Aug by Carl and Tolu here.

Okay, thanks again. That explains a lot. This Windows/Intel user never went into the long MAC thread - in my defence!

The following one liner in the MAC thread is good enough for me - what Carl said.

Meantime, will keep a closer eye on how every HadAM3P model finishes (next up tomorrow).

End of thread :-)
ID: 37853 · Report as offensive     Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 9 Jan 05
Posts: 30
Credit: 434,469
RAC: 0
Message 37883 - Posted: 24 Aug 2009, 1:53:55 UTC

For what it\'s worth, this task seems similar. It\'s trickled just twice, got credit but neither trickle has appeared. I\'ve had the task suspended since, thinking this was a temporary part of the server issues.

I believe the last trickle was on August 16th. I also see see a trickle_down_0 file in the slot dir with that date, containing <abort>cleanup</abort>, is that a killer trickle? fwiw I resumed the task for a bit and it shows no sign of stopping.

I\'ve been wondering if it was safe to go back in the CPDN water wrt server and database issues. What do you advise here, should I let this short task go to completion or abort it and try another?

Thanks
[B^S] sTrey
ID: 37883 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 37884 - Posted: 24 Aug 2009, 2:20:32 UTC

A \"killer trickle\" will give an Error 99. It should Abort the task.

I resumed the task for a bit

From a backup?
ID: 37884 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 37885 - Posted: 24 Aug 2009, 2:36:58 UTC
Last modified: 24 Aug 2009, 2:37:18 UTC

Thanks for your report, sTrey.

You downloaded this task on 12 August (same date as iansm\'s tasks). The task\'s workunit #6509003 can\'t be found on the new database. I\'m afraid this WU is part of the smaller \'black hole\'. Carl said about this smaller hole: \'They\'re the ones that are \'killer trickled\' as I think workunits were being created as I was archiving some, so the id\'s are not valid.\'

<abort>cleanup</abort> in your slot directory was probably intended to be a killer trickle, but it appears to have been ineffective. When Carl released a killer trickle a few years ago to eliminate a batch of defective BBC models, as far as I know all the models crashed immediately the next time they contacted the server; they all had a -99 error code. But the BBC models all had valid WU IDs.

You will need to abort this task, sorry. If you continue crunching it I don\'t think you will receive credit for the trickles and I don\'t think the data will be usable because it won\'t go into in the new database.
Cpdn news
ID: 37885 · Report as offensive     Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 9 Jan 05
Posts: 30
Credit: 434,469
RAC: 0
Message 37888 - Posted: 24 Aug 2009, 5:58:37 UTC

Thanks Mo that saves wasted effort. (Les, I had no backup yet, I had just suspended the task for the last week or more, and resumed it for a while to verify it wouldn\'t terminate itself). I\'ve aborted the task and will get another.
ID: 37888 · Report as offensive     Reply Quote

Message boards : Number crunching : No Trickles in Task Details

©2024 cpdn.org