climateprediction.net (CPDN) home page
Thread 'Announcement: Database residual problem - misallocated WUs'

Thread 'Announcement: Database residual problem - misallocated WUs'

Message boards : Number crunching : Announcement: Database residual problem - misallocated WUs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
old_user5322

Send message
Joined: 31 Aug 04
Posts: 4
Credit: 52,004
RAC: 0
Message 12959 - Posted: 30 May 2005, 23:10:16 UTC
Last modified: 30 May 2005, 23:10:40 UTC

Hi there. My computer (Host 7970) had just finished Result name 1wgn_200109632_1 Result ID 622983. Result doesn't appear to have been received. Now just over 5% of the way through Result name 42oc_100211998_0 Result ID 886824. No credit accruing. Clicking on Result ID reveals that host is 71110. Also:

stderr out

4.25
Il n'y a pas de processus enfant à attendre. (0x80) - exit code 128 (0x80)



Is that significant?

Moose.
ID: 12959 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 12968 - Posted: 31 May 2005, 6:21:44 UTC

Have a look at <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/trickle.php?resultid=857093">this result 857093</a> it is listed to host 86867 but 'till yesterday there was host 5957 trickling on it but now it is host 103138. Very strange it is and showing that things go more and more wrong.

Ciao
ID: 12968 · Report as offensive     Reply Quote
old_user26115
Avatar

Send message
Joined: 21 Oct 04
Posts: 24
Credit: 207,633
RAC: 0
Message 12984 - Posted: 31 May 2005, 19:37:45 UTC - in response to Message 12968.  
Last modified: 31 May 2005, 19:39:07 UTC

&gt; Have a look at <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/trickle.php?resultid=857093">this result 857093</a> it is listed to host 86867 but 'till yesterday there was
&gt; host 5957 trickling on it but now it is host 103138. Very strange it is and
&gt; showing that things go more and more wrong.
&gt;
&gt; Ciao
&gt;
IMO: Host 103138 is the 'owner' of this result, and somehow the trickling changed back.

greetz littleBouncer
ID: 12984 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 12987 - Posted: 31 May 2005, 20:32:00 UTC - in response to Message 12984.  

&gt; &gt; Have a look at <a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/trickle.php?resultid=857093"&gt;this
&gt; result 857093</a> it is listed to host 86867 but 'till yesterday there was
&gt; &gt; host 5957 trickling on it but now it is host 103138. Very strange it is
&gt; and
&gt; &gt; showing that things go more and more wrong.
&gt; &gt;
&gt; &gt; Ciao
&gt; &gt;
&gt; IMO: Host 103138 is the 'owner' of this result, and somehow the trickling
&gt; changed back.
&gt;
&gt; greetz littleBouncer
&gt;

The "owner" as shown on the <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=857093">result page</a> is my host 86867. But I'm not working on it.
ID: 12987 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 12992 - Posted: 31 May 2005, 21:21:18 UTC

That 857093 is the first I have seen allocated 3 times. It is allocated to host 86867 which is not running. Hosts 103138 and probably 5957 are running it. Idealy host 5957 ( Damitch ) would abort this model. I know of no way to try to make contact. (Had there been a team, I could have tried the team web site.)
ID: 12992 · Report as offensive     Reply Quote
dajashby

Send message
Joined: 1 Sep 04
Posts: 55
Credit: 17,223,688
RAC: 967
Message 12994 - Posted: 31 May 2005, 22:34:28 UTC

I have a similar problem with my host 71544, which seems to have hijacked result 865168 from host 92060. It has contributed 9 trickles since 26th May but received no credit for them, and the result is not listed in its result list. Should I abort the result by resetting the project?
Derrick Ashby
ID: 12994 · Report as offensive     Reply Quote
talister

Send message
Joined: 31 Aug 04
Posts: 2
Credit: 21,646,385
RAC: 10,617
Message 13005 - Posted: 1 Jun 2005, 15:26:26 UTC

I have a workunit 3woq_100204159 which has been downloaded by my host 85752 with 041034 timesteps completed but hasn't trickled and isn't listed in the results section. I was watching it went it came up to the trickle at 32406 and the timestep jumped from 32401 to 32407 and so never trickled.
ID: 13005 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 13008 - Posted: 1 Jun 2005, 16:40:18 UTC

Talister, that does sound strange. If I have got it correct, that is a dual opteron and one processor is trickling but the other isn't. My guess is that it tried to trickle but that trickle had been completed by someone else first so it was rejected. I would abort that WU.
ID: 13008 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 13010 - Posted: 1 Jun 2005, 16:53:38 UTC

Dajashby &amp; Josgre

It seems a pity to waste models that are half complete if they can be saved. I just don't know if they can be saved from the point of view of the science or of the credits.

If you think someone else is running them, abort may well be a preferable option. If there is no sign of trickles from other machines then I am a bit stumped as to what to say. I don't want to give false hope OTOH I also do NOT want to say/give impression that you may as well give up on such units.
ID: 13010 · Report as offensive     Reply Quote
old_user13614

Send message
Joined: 6 Sep 04
Posts: 6
Credit: 195,123
RAC: 0
Message 13013 - Posted: 1 Jun 2005, 17:12:48 UTC

Just to check a bit further - the misallocated WU I reported above (WU 565036) has now progressed into phase 2. On the results page, however, the graphs for temperature and precipitation in phase 1 are not available as links.

In addition, on the results page, the "stderr out" does have a value indicating that the unit failed with exit code 5.

My machine continues to crunch, and the trickles keep piling up. My suspicion is that the important data is all here on my machine - but the bogus exit code (perhaps submitted by another user on this unit - perhaps just some other wierd glitch) is preventing the graphs from being available.

I am happy to do whatever is most appropriate (abandon the model / stop and backup the data until the problem can be addressed / just keep crunching / ...) and realize that no one may really have the best answer at this point.

If someone wants to contact me directly, my email is djd at isd dot net (slightly disguised from the spammers spiders.)

Thanks,
ID: 13013 · Report as offensive     Reply Quote
talister

Send message
Joined: 31 Aug 04
Posts: 2
Credit: 21,646,385
RAC: 10,617
Message 13016 - Posted: 1 Jun 2005, 17:42:24 UTC - in response to Message 13008.  

&gt; Talister, that does sound strange. If I have got it correct, that is a dual
&gt; opteron and one processor is trickling but the other isn't. My guess is that
&gt; it tried to trickle but that trickle had been completed by someone else first
&gt; so it was rejected. I would abort that WU.
&gt;

I was a bit surprised that it had downloaded another model as the trickling one had just come to the end of phase 2. Unless BOINC thinks end of phase=100%=need a new model. I've aborted the non-trickling one, it had only got a few trickles into phase 1 and Opterons are fast ;-)
ID: 13016 · Report as offensive     Reply Quote
Profileold_user304

Send message
Joined: 6 Aug 04
Posts: 5
Credit: 29,588
RAC: 0
Message 13017 - Posted: 1 Jun 2005, 17:52:17 UTC - in response to Message 12691.  

I ve all my trickles which don't give me credit on tis unit:2m04_300143060_1

host ID:169066
result ID:855609


ID: 13017 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 13116 - Posted: 4 Jun 2005, 13:26:38 UTC
Last modified: 4 Jun 2005, 13:27:26 UTC

I have changed the first post again. Extra clarity hopefully and I have removed requests for information on affected results, unless it shows some new aspect of the problem or it shows a different affected time period. So still report if you have a resultid less than 719000 or more than 890000 that is affected.

Hopefully there shouldn't be any new cases.
ID: 13116 · Report as offensive     Reply Quote
ProfileFriedrich S.

Send message
Joined: 22 Jan 05
Posts: 41
Credit: 4,606,346
RAC: 1,409
Message 13136 - Posted: 5 Jun 2005, 22:32:50 UTC - in response to Message 12691.  
Last modified: 5 Jun 2005, 22:33:10 UTC

I am not sure if I should report this, but the following applies:

&gt; 2. If there is a WU in your list of result <b>AND</b> there is work done by
&gt; another computer.

&gt; <b>Edits 4th June</b>
&gt; Removed intructions to report resultids, host numbers and WU names. We would
&gt; still be grateful for any reports that indicate a new aspect or different
&gt; affected time period eg affected resultids less than 719000 or greater than
&gt; 890,000.

I don't compute it but I get credits for:
Result ID: 890190
WU ID: 592420
Host ID: 88435

The trickles are listed for Host ID: 173084
(Who probably does the computing...)

Friedrich
ID: 13136 · Report as offensive     Reply Quote
Profileold_user8702

Send message
Joined: 2 Sep 04
Posts: 2
Credit: 14,688
RAC: 0
Message 13138 - Posted: 5 Jun 2005, 23:36:43 UTC

Boinc is trying to get new cp wu but no joy.

climateprediction.net - 2005-06-05 21:06:58 - Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
climateprediction.net - 2005-06-05 21:07:01 - Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded
climateprediction.net - 2005-06-05 21:07:01 - No work from project
climateprediction.net - 2005-06-05 21:07:01 - Deferring communication with project for 15 minutes and 12 seconds

Is there really no work?

ID: 13138 · Report as offensive     Reply Quote
Profileold_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 13148 - Posted: 6 Jun 2005, 5:29:10 UTC

I got the same problem ... I ran one computer dry and been waiting for it to get the debts right and now I can't get work :(

sigh ...
ID: 13148 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 13150 - Posted: 6 Jun 2005, 5:36:26 UTC

Others have asked the same elsewhere. It looks as though all this "mis-allocated - kill it" business has dried up the pool. (Apologises to Aussies for putting it this way.)
Oxford will be open in a few hours, so as soon as Neil has had a good strong cup of coffee .....

Les


ID: 13150 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 13153 - Posted: 6 Jun 2005, 8:15:14 UTC

Additionally to the misalocated wu problem I observed the last few days that credits are not calculated correctly. This new problem started with the rebuild of server software. So for all hosts all trickles that are send to cpdn are shown but for some hosts the granted credit doesn't match the number of trickles times ~94.52 According to this also the total credit isn't correct.

Ciao
ID: 13153 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 13154 - Posted: 6 Jun 2005, 8:26:38 UTC - in response to Message 13136.  

&gt; &gt; <b>Edits 4th June</b>
&gt; &gt; Removed intructions to report resultids, host numbers and WU names. We would
&gt; &gt; still be grateful for any reports that indicate a new aspect or different
&gt; &gt; affected time period eg affected resultids less than 719000 or greater than
&gt; &gt; 890,000.
&gt;
&gt; I don't compute it but I get credits for:
&gt; Result ID: 890190
&gt; WU ID: 592420
&gt; Host ID: 88435
&gt;
&gt; The trickles are listed for Host ID: 173084
&gt; (Who probably does the computing...)

That one was sent out before the upgrade was applied Friedrich.

It looks like the first post-upgrade result is 904515, so that's the actual cut-off point.
<br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a>
ID: 13154 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Announcement: Database residual problem - misallocated WUs

©2024 cpdn.org