climateprediction.net (CPDN) home page
Thread 'Announcement: Database residual problem - misallocated WUs'

Thread 'Announcement: Database residual problem - misallocated WUs'

Message boards : Number crunching : Announcement: Database residual problem - misallocated WUs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 13155 - Posted: 6 Jun 2005, 8:38:15 UTC - in response to Message 13153.  

> Additionally to the misalocated wu problem I observed the last few days that
> credits are not calculated correctly. This new problem started with the
> rebuild of server software. So for all hosts all trickles that are send to
> cpdn are shown but for some hosts the granted credit doesn't match the number
> of trickles times ~94.52 According to this also the total credit isn't
> correct.

Can you point out any examples of this behaviour? I don't see this on <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=859271">result id 859271</a>, which has generated 6 trickles since starting on Friday evening and has 567.11 credits.
<br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a>
ID: 13155 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 13158 - Posted: 6 Jun 2005, 9:37:47 UTC - in response to Message 13155.  
Last modified: 6 Jun 2005, 10:53:26 UTC

&gt; Can you point out any examples of this behaviour? I don't see this on <a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=859271"&gt;result
&gt; id 859271</a>, which has generated 6 trickles since starting on Friday evening
&gt; and has 567.11 credits.

Hi,

currently this are the following results...
<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=872328">Result 872328</a>: 28 Trickles and 2551.97 of ~2646,56 credits

<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=780535">Result 780535</a>: 34 Trickles and 3119.08 of ~3213,68 credits

<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=698345">Result 698345</a>: 60 Trickles and 5576.53 of ~5671,20 credits

<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=677468">Result 677468</a>: 72 Trickles and 6710.74 of ~6805.26 credits

and some other results...

normally credits are corrected if an additional trickle gets in.

Ciao

ID: 13158 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 13160 - Posted: 6 Jun 2005, 10:30:20 UTC

The credits are calculated every 4 hours so I think it is quite normal for there to be an extra trickle that hasn't been counted yet. Some of those differences are not quite 94.52 but that is because it is really 94.5175.

780535 trickled last 5 June 21:37 hmm - time difference? doubt it.

Don't understand 677468 either. Wonder if Carl's recalc is off.
ID: 13160 · Report as offensive     Reply Quote
ProfileHonza
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 390
Credit: 2,475,242
RAC: 0
Message 13161 - Posted: 6 Jun 2005, 10:32:45 UTC

Hmm, that's interesting point to check results and misallocated WUs by credit (resp. trickle). It may be that 'only' part of the WU got misallocated.

Results 807562, 33 trcikles and 3024.56 of ~ 3119.08; last trickle is not calculated.

smudodd - note that example you have given copy the same scenario - only last trickle is not calculated.
_________________________
<i>Everyday i wished classic phpBB forum being back and after 60 prayers, all are </i><a href="http://www.climateprediction.net/board">invited</a>
ID: 13161 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 13163 - Posted: 6 Jun 2005, 10:51:14 UTC - in response to Message 13160.  
Last modified: 6 Jun 2005, 10:52:18 UTC

&gt; 780535 trickled last 5 June 21:37 hmm - time difference? doubt it.

No time difference... utc+2 hours... so after at least 4 hrs it had should been fixed. But it will be fixed (hopefully) if another trickle by this host is send in. But this behaviour is not normal in the meaning that this wasn't normal before the server software was updated.

And an update to the last post:
Result 872328: 28 Trickles and 2551.97 of ~2646,56 credits

Host <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=86867">86867</a> has trickled this morning (in europe) and now the trickle that was not credited till yet is now credited in this way.

So I don't know. Is it possible that the validator or sheduler hangs some time back or that some reported trickles are lost and if the next incremental trickle is handed in the validator gives credit to the highest known trickle?

Ciao
ID: 13163 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 13166 - Posted: 6 Jun 2005, 12:13:23 UTC - in response to Message 13158.  

With the exception of result 677468 (which I can't explain either) your credits now look to be right

Result 872328: 29 trickles, 2741.01 credits
Result 780535: 35 trickles, 3308.11 credits
Result 698345: 61 trickles, 5765.57 credits
<br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a>
ID: 13166 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 13172 - Posted: 6 Jun 2005, 12:49:35 UTC - in response to Message 13166.  

&gt; With the exception of result 677468 (which I can't explain either) your
&gt; credits now look to be right
&gt;
&gt; Result 872328: 29 trickles, 2741.01 credits
&gt; Result 780535: 35 trickles, 3308.11 credits
&gt; Result 698345: 61 trickles, 5765.57 credits


As you can see all hosts have trickled one times since my first post. Fell free to have a look at these results this evening or tomorrow, when some host have trickled another time but credit is not granted till one further trickle gets in.

It is not so much problem to me as it seems that things are corrected when further trickles arrive and credits are not much worth then the science behind. But if one result would be briefly before finish I would feel better if all things run normal as they should. You may feel different but I have the strange feeling that something isn't running correct and we will hear more complaints.

Ciao
ID: 13172 · Report as offensive     Reply Quote
old_user4187

Send message
Joined: 31 Aug 04
Posts: 3
Credit: 94,401
RAC: 0
Message 13174 - Posted: 6 Jun 2005, 14:39:28 UTC

I have been wonder why my toal credits have not been upgraded for the past 4 days and since I came across this thred I understand the problem.

When I had a look at my results I noted that the trickles for 2sns_400151773_0 that I've been sending in is linked to Host ID 61619

I sent this last trickle on 5 Jun 05 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=844975

and if you follow it, you can see that the Host ID on that page shows 61619 where my host ID is 165388

If there is someone looking into these cross-linked DB problems I hope this information can help

Ray Doiron
ID: 13174 · Report as offensive     Reply Quote
old_user28498

Send message
Joined: 4 Nov 04
Posts: 16
Credit: 11,577,003
RAC: 0
Message 13278 - Posted: 9 Jun 2005, 9:52:03 UTC

Do we still have problems with misallocated WUs?

My host 158490 has, allegedly, received result 923715, 001u_600025051_1 (created 7 Jun 2005 21:18:18 UTC, sent 8 Jun 2005 23:15:12 UTC). However, the host does not have any file named 001u_600025051.zip in its climateprediction.net directory where WUs are queued for crunching.

If I see credits coming I'll check the hostid reported by the trickles.

LS

ID: 13278 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 13282 - Posted: 9 Jun 2005, 11:41:35 UTC - in response to Message 13278.  

&gt; Do we still have problems with misallocated WUs?

The recent server upgrade should prevent that from happening again.

&gt; My host 158490 has, allegedly, received result 923715, 001u_600025051_1
&gt; (created 7 Jun 2005 21:18:18 UTC, sent 8 Jun 2005 23:15:12 UTC). However, the
&gt; host does not have any file named 001u_600025051.zip in its
&gt; climateprediction.net directory where WUs are queued for crunching.

That will happen if the scheduler reply telling you to download that result went missing. If you look at your stdout.txt file you should see a request for work with no scheduler response followed by a request and reply that caused you to download 08rv_000016322.

There is a pending BOINC change to align the work downloaded by a host with what the server database has allocated to it.
<br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a>
ID: 13282 · Report as offensive     Reply Quote
old_user28498

Send message
Joined: 4 Nov 04
Posts: 16
Credit: 11,577,003
RAC: 0
Message 13284 - Posted: 9 Jun 2005, 13:03:58 UTC

Many thanks Thyme Lawn. I was looking for something like stdout or stderr to check if there were messages clarifying this issue, but without any luck. I do not have any stdout file neither in the BOINC directory nor in the climeprediction.net one (BOINC 4.19, hadsm 4.13, Linux). If this is just a scheduler matter, then all is well.

LS

ID: 13284 · Report as offensive     Reply Quote
old_user66137

Send message
Joined: 22 Mar 05
Posts: 2
Credit: 371,502
RAC: 0
Message 13343 - Posted: 11 Jun 2005, 20:45:20 UTC

There is still a problem with misallocated WUs, even outside the range you've noted. I am HostID 138185. For the last 10 days, I was working on ResultID 895452, which relates to WU 582416. I have sent 22 trickles from this WU, with granted credit of 2079.39. However, none of this information shows up on my results page - there is no mention of the ResultID or WUID on the list, and it appears as if I have done nothing between 2nd June (when I finished my last unit) and today (when I have reset the project, thinking something had gone wrong somewhere.)

Now, if you look at page http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=895452, you will see the data about this WU. However, it is allocated to HostID 49231. If you click on the "Trickle # 22" link, you will see that the trickles have been submitted between 2nd and 11th June. However, if you follow the link to HostID 49231, you'll see that they haven't sent any trickles since 15th May. Now, when you go to my hostID page (http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=138185) you'll see that the latest trickles show the HostID of this ResultID as "hidden". So something has gone wrong with this somewhere and it is hiding the HostID (and getting it wrong).

The unfortunate thing is, since I couldn't see anything appearing on my results page, I thought that the results weren't getting through to you, so reset the project. This means that the 209 hours of work that I have done has now been wasted, which is a pity for your project as it will have to be started again by someone else. Unfortunately I didn't think to check the forums before resetting the project, only later. It may be worth noting that SETI and Einstein, which I also use, have useful information on their front page that I look at when there are problems - with no similar page on here, I don't have anything obvious to check against. It may be worth creating something like this for the future?

I hope this is helpful to you in tracking down this problem.

Kind regards,

Laluki.
ID: 13343 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 13344 - Posted: 11 Jun 2005, 21:10:03 UTC - in response to Message 13343.  

&gt; There is still a problem with misallocated WUs, even outside the range you've
&gt; noted. I am HostID 138185. For the last 10 days, I was working on ResultID
&gt; 895452, which relates to WU 582416. I have sent 22 trickles from this WU, with


Previously in this thread, Thyme posted this:

"It looks like the first post-upgrade result is 904515, so that's the actual cut-off point."
ID: 13344 · Report as offensive     Reply Quote
old_user3401

Send message
Joined: 30 Aug 04
Posts: 1
Credit: 61,134
RAC: 0
Message 13428 - Posted: 14 Jun 2005, 5:48:38 UTC

Thought I'd report another misallocated work unit in case you are still needing data. I've been working on 2uxz_300154762_0. My host id is 54134 and
I'm at timestep 194436, phase 2. Result ID is 847965. This work unit appears to be credited to host id 147433, although it is showing a client error.
My machine is still crunching away.
ID: 13428 · Report as offensive     Reply Quote
old_user13614

Send message
Joined: 6 Sep 04
Posts: 6
Credit: 195,123
RAC: 0
Message 13471 - Posted: 15 Jun 2005, 17:27:35 UTC

My problem work unit has now completed. On the results page, you can get the graphs for phases 2 and 3, but not for phase 1, nor for the full run.

I did back up the work unit folder last night just before completion, so if anything needs to be redone, it should be possible to pick up most of the way through the unit.

All the messages in my client seem to indicate that the work unit uploaded successfully.

Hope all is working smoother now.
ID: 13471 · Report as offensive     Reply Quote
Profileold_user14735

Send message
Joined: 7 Sep 04
Posts: 14
Credit: 160,054
RAC: 0
Message 13547 - Posted: 18 Jun 2005, 12:17:13 UTC

I'm currently processing a misallocated work unit (the work unit name in BOINC is 2vvo_300155986_1) which I'm over 50% of the way through. I don't want to kill it if the science will be useful and I don't particularly care whether I get credits or not for this one. Reading the guidelines it seems like I'd still be better to kill it off if it's being processed by someone else though. However, I'm just a little bit confused about how to tell whether my WU is being processed elsewhere or not...

I've looked at at the trickles that are recorded against my account and from that list, clicked on the result ID (864094).

This shows the name I'd expect (2vvo_300155986_1). When I click the link for the work unit (568245) this shows me a screen where the work unit name is 2vvo_300155986 : exactly the same as mine except without the _1 suffix at the end of the name. Two hosts are marked as crunching it, neither of them me, but both have crashed with a computing error. So on the face of it I'm better to continue assuming 2vvo_300155986 and 2vvo_300155986_1 are really the same thing.

If I click on the host ID from the page for result 864094 I see a machine where the latest trickles are for an entirely different work unit. In fact it's one of the machines that crashed on 2vvo_300155986.

So it looks like I ought to carry on with this unit.

Is this analysis correct and I should continue with 2vvo_300155986_1 or should I kill if off?
ID: 13547 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 13552 - Posted: 18 Jun 2005, 15:25:30 UTC

Yes you should be fine continuing that WU if you don't mind about credits or are willing to continue and hope the credits will get sorted at some stage.

There is no need to worry about the same work unit being sent out under a different resultid. As you say the computer that did complete the first trickle is now processing a different WU (and it isn't a hyperthreaded processor either).
ID: 13552 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 13576 - Posted: 19 Jun 2005, 8:33:27 UTC

From the problems Heiner explained here :

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2746

it looks very much as if those misallocated results can cause more database trouble when merging hosts.
ID: 13576 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 13580 - Posted: 19 Jun 2005, 16:06:54 UTC

Sorry I haven't understood your German.

Are you sure it is misallocated WUs causing trouble rather than just database trouble causing problems with merging hosts at present?
ID: 13580 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 13581 - Posted: 19 Jun 2005, 19:26:42 UTC
Last modified: 19 Jun 2005, 19:29:12 UTC

Heiner tried to merge two hosts. This failed, probably because the WUs or the results are connected to a different host.

It is very likely that "Couldn't update results" is a problem with the SQL itself so there might be a connection to the misallocated results. One would have to check the SQL error text though but we don't have that.

He has credits but no trickles on one host so he is affected by the problem this thread refers to.
________

Most of my reply to Heiner was about a different problem, he had some swap space error too in one result - that was easier to explain in german because of the german windows settings.
ID: 13581 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Announcement: Database residual problem - misallocated WUs

©2024 cpdn.org