Message boards :
Number crunching :
Incorrect credit allocation
Message board moderation
Author | Message |
---|---|
Send message Joined: 6 Aug 04 Posts: 58 Credit: 1,286,603 RAC: 0 |
Looks like the recent outage has caused another problem with misallocated credits. I wondered why the trickes from one of my machines weren\'t showing up in it\'s \"received trickles\" list (I know the rest of the stats are broken but trickle acknowledgement seems to be OK) and the reason is that the trickes are being allocated to another machine. Run ID - 872375 My box (321) shown as trickling up to 19th June Computer Id 5336 shown as trickling since then. What\'s the best thing to do? - I\'m tempted to abort the run. Ian |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,235,815 RAC: 9,398 |
Hi there If you are doing a run that appears to be allocating credit to another machine, it looks as if you possibly have a new variant of the problem in this <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2577">thread</a> I see from the results table what you mean, up to Ph2, T/S 10802 on 19/06, it's allocated to your machine 321, threafter from Ph2, T/S 21604 on 20/06, the same Result#ID is now being allocated to <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/trickle.php?resultid=872375">machine ID5336</a> belonging to <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_user.php?userid=2790">BakerB</a> The correct interpretation and solution can only come from the Project IT guys Pete |
Send message Joined: 6 Aug 04 Posts: 58 Credit: 1,286,603 RAC: 0 |
Thanks for confirming that there's somthing strange with that result ID > If you are doing a run that appears to be allocating credit to another > machine, it looks as if you possibly have a new variant of the problem in this > <a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2577">thread</a> I've previously read that thread a number of times and I'm still not sure exactly what problems it is describing and whether my problem is related. I thought it worth starting a new thread though as the one you quote was related to the first outage (end of May) and not the current one. Thanks Ian <img src='http://www.boincsynergy.com/images/stats/comb-942.jpg'> |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
This problem is a misallocated WU problem. I see nothing different from other cases. The resultid has been sent to 2 computers (5336 and 321). It is allocated to your computer 321. 5336 started later but is computing faster and has overtaken you. We don't want 2 computers completing the same resultid (to avoid loosing the science). Therefore I would suggest at least suspending the work unit. When they start computing the credit again you may get credit for the work 5336 has done but at some stage it is possible that they will 'fix' the problem and give the credit for this resultid to whoever got furthest, ie 5336. (That would be a bad news fix for you but generally much fairer for most people.) So continuing or not will likely not affect the credits you get. So whether for the science or the credit, there is little point in continuing with that resultid. If not continuing, is it best to suspend or abort or reset? With BOINC v4.19 reset is by far the easiest but should be avoided for a hyperthreaded computer. Suspend has advantages: if 5336 happens to crash the model, you could then usefully complete it. Problem with suspend: If you just suspend as opposed to abort, can you get another WU? (You might try setting the connect at most every x days high but try to avoid connecting to other projects with that high setting.) HTH |
Send message Joined: 6 Aug 04 Posts: 58 Credit: 1,286,603 RAC: 0 |
> This problem is a misallocated WU problem. I see nothing different from other > cases. OK, thanks. > The resultid has been sent to 2 computers (5336 and 321). It is > allocated to your computer 321. 5336 started later but is computing faster and > has overtaken you. That makes sense. It would also explain why the WU doesn't show up in 5336's work list - I got it first :-) I suspended when I saw the problem and the machine is happily crunching another WU (it has a problem with 4.45 and an ever increasing short term debt, in trying to sort that out I had already downloaded another two WUs). I'll leave it in that state until 5336 finishes - but might not that machine have a problem when it tries to upload the result if the machine id doesn't match the one the WU was allocated to?. Ian <img src='http://www.boincsynergy.com/images/stats/comb-942.jpg'> |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
> > This problem is a misallocated WU problem. I see nothing different from > other > > cases. > > OK, thanks. > > > The resultid has been sent to 2 computers (5336 and 321). It is > > allocated to your computer 321. 5336 started later but is computing > faster and > > has overtaken you. > > That makes sense. It would also explain why the WU doesn't show up in 5336's > work list - I got it first :-) (Not that it matters, but I suspect it was sent to 5336 first but failed to get allocated to any host (probably database timeout/out of hard disk space - BOINC has now been fixed to stop this happening). This left it available to be sent and allocated to you. It just stayed waiting to be started on 5336 computer longer than it was waiting on your computer. It doesn't show up on 5336's list because it isn't allocated to 5336.) > > I suspended when I saw the problem and the machine is happily crunching > another WU (it has a problem with 4.45 and an ever increasing short term debt, > in trying to sort that out I had already downloaded another two WUs). I'll > leave it in that state until 5336 finishes - but might not that machine have a > problem when it tries to upload the result if the machine id doesn't match the > one the WU was allocated to?. Indications are that it uploads without problems. |
©2024 cpdn.org