Message boards : Number crunching : Announcement: Database residual problem - misallocated WUs
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 4 Credit: 52,004 RAC: 0 |
Hi there. My computer (Host 7970) had just finished Result name 1wgn_200109632_1 Result ID 622983. Result doesn't appear to have been received. Now just over 5% of the way through Result name 42oc_100211998_0 Result ID 886824. No credit accruing. Clicking on Result ID reveals that host is 71110. Also: stderr out 4.25 Il n'y a pas de processus enfant à attendre. (0x80) - exit code 128 (0x80) Is that significant? Moose. |
Send message Joined: 28 Aug 04 Posts: 90 Credit: 2,736,552 RAC: 0 |
Have a look at <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/trickle.php?resultid=857093">this result 857093</a> it is listed to host 86867 but 'till yesterday there was host 5957 trickling on it but now it is host 103138. Very strange it is and showing that things go more and more wrong. Ciao |
Send message Joined: 21 Oct 04 Posts: 24 Credit: 207,633 RAC: 0 |
> Have a look at <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/trickle.php?resultid=857093">this result 857093</a> it is listed to host 86867 but 'till yesterday there was > host 5957 trickling on it but now it is host 103138. Very strange it is and > showing that things go more and more wrong. > > Ciao > IMO: Host 103138 is the 'owner' of this result, and somehow the trickling changed back. greetz littleBouncer |
Send message Joined: 28 Aug 04 Posts: 90 Credit: 2,736,552 RAC: 0 |
> > Have a look at <a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/trickle.php?resultid=857093">this > result 857093</a> it is listed to host 86867 but 'till yesterday there was > > host 5957 trickling on it but now it is host 103138. Very strange it is > and > > showing that things go more and more wrong. > > > > Ciao > > > IMO: Host 103138 is the 'owner' of this result, and somehow the trickling > changed back. > > greetz littleBouncer > The "owner" as shown on the <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=857093">result page</a> is my host 86867. But I'm not working on it. |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
That 857093 is the first I have seen allocated 3 times. It is allocated to host 86867 which is not running. Hosts 103138 and probably 5957 are running it. Idealy host 5957 ( Damitch ) would abort this model. I know of no way to try to make contact. (Had there been a team, I could have tried the team web site.) |
Send message Joined: 1 Sep 04 Posts: 55 Credit: 17,223,688 RAC: 967 |
I have a similar problem with my host 71544, which seems to have hijacked result 865168 from host 92060. It has contributed 9 trickles since 26th May but received no credit for them, and the result is not listed in its result list. Should I abort the result by resetting the project? Derrick Ashby |
Send message Joined: 31 Aug 04 Posts: 2 Credit: 21,646,385 RAC: 10,617 |
I have a workunit 3woq_100204159 which has been downloaded by my host 85752 with 041034 timesteps completed but hasn't trickled and isn't listed in the results section. I was watching it went it came up to the trickle at 32406 and the timestep jumped from 32401 to 32407 and so never trickled. |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
Talister, that does sound strange. If I have got it correct, that is a dual opteron and one processor is trickling but the other isn't. My guess is that it tried to trickle but that trickle had been completed by someone else first so it was rejected. I would abort that WU. |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
Dajashby & Josgre It seems a pity to waste models that are half complete if they can be saved. I just don't know if they can be saved from the point of view of the science or of the credits. If you think someone else is running them, abort may well be a preferable option. If there is no sign of trickles from other machines then I am a bit stumped as to what to say. I don't want to give false hope OTOH I also do NOT want to say/give impression that you may as well give up on such units. |
Send message Joined: 6 Sep 04 Posts: 6 Credit: 195,123 RAC: 0 |
Just to check a bit further - the misallocated WU I reported above (WU 565036) has now progressed into phase 2. On the results page, however, the graphs for temperature and precipitation in phase 1 are not available as links. In addition, on the results page, the "stderr out" does have a value indicating that the unit failed with exit code 5. My machine continues to crunch, and the trickles keep piling up. My suspicion is that the important data is all here on my machine - but the bogus exit code (perhaps submitted by another user on this unit - perhaps just some other wierd glitch) is preventing the graphs from being available. I am happy to do whatever is most appropriate (abandon the model / stop and backup the data until the problem can be addressed / just keep crunching / ...) and realize that no one may really have the best answer at this point. If someone wants to contact me directly, my email is djd at isd dot net (slightly disguised from the spammers spiders.) Thanks, |
Send message Joined: 31 Aug 04 Posts: 2 Credit: 21,646,385 RAC: 10,617 |
> Talister, that does sound strange. If I have got it correct, that is a dual > opteron and one processor is trickling but the other isn't. My guess is that > it tried to trickle but that trickle had been completed by someone else first > so it was rejected. I would abort that WU. > I was a bit surprised that it had downloaded another model as the trickling one had just come to the end of phase 2. Unless BOINC thinks end of phase=100%=need a new model. I've aborted the non-trickling one, it had only got a few trickles into phase 1 and Opterons are fast ;-) |
Send message Joined: 6 Aug 04 Posts: 5 Credit: 29,588 RAC: 0 |
I ve all my trickles which don't give me credit on tis unit:2m04_300143060_1 host ID:169066 result ID:855609 |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
I have changed the first post again. Extra clarity hopefully and I have removed requests for information on affected results, unless it shows some new aspect of the problem or it shows a different affected time period. So still report if you have a resultid less than 719000 or more than 890000 that is affected. Hopefully there shouldn't be any new cases. |
Send message Joined: 22 Jan 05 Posts: 41 Credit: 4,606,346 RAC: 1,409 |
I am not sure if I should report this, but the following applies: > 2. If there is a WU in your list of result <b>AND</b> there is work done by > another computer. > <b>Edits 4th June</b> > Removed intructions to report resultids, host numbers and WU names. We would > still be grateful for any reports that indicate a new aspect or different > affected time period eg affected resultids less than 719000 or greater than > 890,000. I don't compute it but I get credits for: Result ID: 890190 WU ID: 592420 Host ID: 88435 The trickles are listed for Host ID: 173084 (Who probably does the computing...) Friedrich |
Send message Joined: 2 Sep 04 Posts: 2 Credit: 14,688 RAC: 0 |
Boinc is trying to get new cp wu but no joy. climateprediction.net - 2005-06-05 21:06:58 - Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi climateprediction.net - 2005-06-05 21:07:01 - Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded climateprediction.net - 2005-06-05 21:07:01 - No work from project climateprediction.net - 2005-06-05 21:07:01 - Deferring communication with project for 15 minutes and 12 seconds Is there really no work? |
Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0 |
I got the same problem ... I ran one computer dry and been waiting for it to get the debts right and now I can't get work :( sigh ... |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Others have asked the same elsewhere. It looks as though all this "mis-allocated - kill it" business has dried up the pool. (Apologises to Aussies for putting it this way.) Oxford will be open in a few hours, so as soon as Neil has had a good strong cup of coffee ..... Les |
Send message Joined: 28 Aug 04 Posts: 90 Credit: 2,736,552 RAC: 0 |
Additionally to the misalocated wu problem I observed the last few days that credits are not calculated correctly. This new problem started with the rebuild of server software. So for all hosts all trickles that are send to cpdn are shown but for some hosts the granted credit doesn't match the number of trickles times ~94.52 According to this also the total credit isn't correct. Ciao |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
> > <b>Edits 4th June</b> > > Removed intructions to report resultids, host numbers and WU names. We would > > still be grateful for any reports that indicate a new aspect or different > > affected time period eg affected resultids less than 719000 or greater than > > 890,000. > > I don't compute it but I get credits for: > Result ID: 890190 > WU ID: 592420 > Host ID: 88435 > > The trickles are listed for Host ID: 173084 > (Who probably does the computing...) That one was sent out before the upgrade was applied Friedrich. It looks like the first post-upgrade result is 904515, so that's the actual cut-off point. <br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a> |
©2024 cpdn.org