climateprediction.net (CPDN) home page
Thread 'Incorrect credit allocation'

Thread 'Incorrect credit allocation'

Message boards : Number crunching : Incorrect credit allocation
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user272

Send message
Joined: 6 Aug 04
Posts: 58
Credit: 1,286,603
RAC: 0
Message 13860 - Posted: 26 Jun 2005, 7:57:38 UTC

Looks like the recent outage has caused another problem with misallocated credits.

I wondered why the trickes from one of my machines weren\'t showing up in it\'s \"received trickles\" list (I know the rest of the stats are broken but trickle acknowledgement seems to be OK) and the reason is that the trickes are being allocated to another machine.

Run ID - 872375
My box (321) shown as trickling up to 19th June
Computer Id 5336 shown as trickling since then.

What\'s the best thing to do? - I\'m tempted to abort the run.

Ian
ID: 13860 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,282,288
RAC: 10,545
Message 13862 - Posted: 26 Jun 2005, 8:35:34 UTC
Last modified: 26 Jun 2005, 8:36:58 UTC

Hi there

If you are doing a run that appears to be allocating credit to another machine, it looks as if you possibly have a new variant of the problem in this <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2577">thread</a>

I see from the results table what you mean, up to Ph2, T/S 10802 on 19/06, it's allocated to your machine 321, threafter from Ph2, T/S 21604 on 20/06, the same Result#ID is now being allocated to <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/trickle.php?resultid=872375">machine ID5336</a>

belonging to <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_user.php?userid=2790">BakerB</a>

The correct interpretation and solution can only come from the Project IT guys

Pete
ID: 13862 · Report as offensive     Reply Quote
old_user272

Send message
Joined: 6 Aug 04
Posts: 58
Credit: 1,286,603
RAC: 0
Message 13863 - Posted: 26 Jun 2005, 8:56:10 UTC - in response to Message 13862.  

Thanks for confirming that there's somthing strange with that result ID

&gt; If you are doing a run that appears to be allocating credit to another
&gt; machine, it looks as if you possibly have a new variant of the problem in this
&gt; <a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2577"&gt;thread</a>

I've previously read that thread a number of times and I'm still not sure exactly what problems it is describing and whether my problem is related.

I thought it worth starting a new thread though as the one you quote was related to the first outage (end of May) and not the current one.

Thanks

Ian
<img src='http://www.boincsynergy.com/images/stats/comb-942.jpg'>
ID: 13863 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 13866 - Posted: 26 Jun 2005, 10:29:21 UTC
Last modified: 26 Jun 2005, 10:33:30 UTC

This problem is a misallocated WU problem. I see nothing different from other cases. The resultid has been sent to 2 computers (5336 and 321). It is allocated to your computer 321. 5336 started later but is computing faster and has overtaken you.

We don't want 2 computers completing the same resultid (to avoid loosing the science). Therefore I would suggest at least suspending the work unit.

When they start computing the credit again you may get credit for the work 5336 has done but at some stage it is possible that they will 'fix' the problem and give the credit for this resultid to whoever got furthest, ie 5336. (That would be a bad news fix for you but generally much fairer for most people.)

So continuing or not will likely not affect the credits you get.

So whether for the science or the credit, there is little point in continuing with that resultid.

If not continuing, is it best to suspend or abort or reset?

With BOINC v4.19 reset is by far the easiest but should be avoided for a hyperthreaded computer.

Suspend has advantages: if 5336 happens to crash the model, you could then usefully complete it.

Problem with suspend: If you just suspend as opposed to abort, can you get another WU? (You might try setting the connect at most every x days high but try to avoid connecting to other projects with that high setting.)

HTH
ID: 13866 · Report as offensive     Reply Quote
old_user272

Send message
Joined: 6 Aug 04
Posts: 58
Credit: 1,286,603
RAC: 0
Message 13868 - Posted: 26 Jun 2005, 11:29:02 UTC - in response to Message 13866.  

&gt; This problem is a misallocated WU problem. I see nothing different from other
&gt; cases.

OK, thanks.

&gt; The resultid has been sent to 2 computers (5336 and 321). It is
&gt; allocated to your computer 321. 5336 started later but is computing faster and
&gt; has overtaken you.

That makes sense. It would also explain why the WU doesn't show up in 5336's work list - I got it first :-)

I suspended when I saw the problem and the machine is happily crunching another WU (it has a problem with 4.45 and an ever increasing short term debt, in trying to sort that out I had already downloaded another two WUs). I'll leave it in that state until 5336 finishes - but might not that machine have a problem when it tries to upload the result if the machine id doesn't match the one the WU was allocated to?.

Ian
<img src='http://www.boincsynergy.com/images/stats/comb-942.jpg'>
ID: 13868 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 13869 - Posted: 26 Jun 2005, 12:52:41 UTC - in response to Message 13868.  
Last modified: 26 Jun 2005, 12:56:51 UTC

&gt; &gt; This problem is a misallocated WU problem. I see nothing different from
&gt; other
&gt; &gt; cases.
&gt;
&gt; OK, thanks.
&gt;
&gt; &gt; The resultid has been sent to 2 computers (5336 and 321). It is
&gt; &gt; allocated to your computer 321. 5336 started later but is computing
&gt; faster and
&gt; &gt; has overtaken you.
&gt;
&gt; That makes sense. It would also explain why the WU doesn't show up in 5336's
&gt; work list - I got it first :-)


(Not that it matters, but I suspect it was sent to 5336 first but failed to get allocated to any host (probably database timeout/out of hard disk space - BOINC has now been fixed to stop this happening). This left it available to be sent and allocated to you. It just stayed waiting to be started on 5336 computer longer than it was waiting on your computer. It doesn't show up on 5336's list because it isn't allocated to 5336.)


&gt;
&gt; I suspended when I saw the problem and the machine is happily crunching
&gt; another WU (it has a problem with 4.45 and an ever increasing short term debt,
&gt; in trying to sort that out I had already downloaded another two WUs). I'll
&gt; leave it in that state until 5336 finishes - but might not that machine have a
&gt; problem when it tries to upload the result if the machine id doesn't match the
&gt; one the WU was allocated to?.


Indications are that it uploads without problems.


ID: 13869 · Report as offensive     Reply Quote

Message boards : Number crunching : Incorrect credit allocation

©2024 cpdn.org