Message boards : Number crunching : Announcement: Database residual problem - misallocated WUs
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
> Additionally to the misalocated wu problem I observed the last few days that > credits are not calculated correctly. This new problem started with the > rebuild of server software. So for all hosts all trickles that are send to > cpdn are shown but for some hosts the granted credit doesn't match the number > of trickles times ~94.52 According to this also the total credit isn't > correct. Can you point out any examples of this behaviour? I don't see this on <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=859271">result id 859271</a>, which has generated 6 trickles since starting on Friday evening and has 567.11 credits. <br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a> |
Send message Joined: 28 Aug 04 Posts: 90 Credit: 2,736,552 RAC: 0 |
> Can you point out any examples of this behaviour? I don't see this on <a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=859271">result > id 859271</a>, which has generated 6 trickles since starting on Friday evening > and has 567.11 credits. Hi, currently this are the following results... <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=872328">Result 872328</a>: 28 Trickles and 2551.97 of ~2646,56 credits <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=780535">Result 780535</a>: 34 Trickles and 3119.08 of ~3213,68 credits <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=698345">Result 698345</a>: 60 Trickles and 5576.53 of ~5671,20 credits <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=677468">Result 677468</a>: 72 Trickles and 6710.74 of ~6805.26 credits and some other results... normally credits are corrected if an additional trickle gets in. Ciao |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
The credits are calculated every 4 hours so I think it is quite normal for there to be an extra trickle that hasn't been counted yet. Some of those differences are not quite 94.52 but that is because it is really 94.5175. 780535 trickled last 5 June 21:37 hmm - time difference? doubt it. Don't understand 677468 either. Wonder if Carl's recalc is off. |
Send message Joined: 5 Aug 04 Posts: 390 Credit: 2,475,242 RAC: 0 |
Hmm, that's interesting point to check results and misallocated WUs by credit (resp. trickle). It may be that 'only' part of the WU got misallocated. Results 807562, 33 trcikles and 3024.56 of ~ 3119.08; last trickle is not calculated. smudodd - note that example you have given copy the same scenario - only last trickle is not calculated. _________________________ <i>Everyday i wished classic phpBB forum being back and after 60 prayers, all are </i><a href="http://www.climateprediction.net/board">invited</a> |
Send message Joined: 28 Aug 04 Posts: 90 Credit: 2,736,552 RAC: 0 |
> 780535 trickled last 5 June 21:37 hmm - time difference? doubt it. No time difference... utc+2 hours... so after at least 4 hrs it had should been fixed. But it will be fixed (hopefully) if another trickle by this host is send in. But this behaviour is not normal in the meaning that this wasn't normal before the server software was updated. And an update to the last post: Result 872328: 28 Trickles and 2551.97 of ~2646,56 credits Host <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=86867">86867</a> has trickled this morning (in europe) and now the trickle that was not credited till yet is now credited in this way. So I don't know. Is it possible that the validator or sheduler hangs some time back or that some reported trickles are lost and if the next incremental trickle is handed in the validator gives credit to the highest known trickle? Ciao |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
With the exception of result 677468 (which I can't explain either) your credits now look to be right Result 872328: 29 trickles, 2741.01 credits Result 780535: 35 trickles, 3308.11 credits Result 698345: 61 trickles, 5765.57 credits <br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a> |
Send message Joined: 28 Aug 04 Posts: 90 Credit: 2,736,552 RAC: 0 |
> With the exception of result 677468 (which I can't explain either) your > credits now look to be right > > Result 872328: 29 trickles, 2741.01 credits > Result 780535: 35 trickles, 3308.11 credits > Result 698345: 61 trickles, 5765.57 credits As you can see all hosts have trickled one times since my first post. Fell free to have a look at these results this evening or tomorrow, when some host have trickled another time but credit is not granted till one further trickle gets in. It is not so much problem to me as it seems that things are corrected when further trickles arrive and credits are not much worth then the science behind. But if one result would be briefly before finish I would feel better if all things run normal as they should. You may feel different but I have the strange feeling that something isn't running correct and we will hear more complaints. Ciao |
Send message Joined: 31 Aug 04 Posts: 3 Credit: 94,401 RAC: 0 |
I have been wonder why my toal credits have not been upgraded for the past 4 days and since I came across this thred I understand the problem. When I had a look at my results I noted that the trickles for 2sns_400151773_0 that I've been sending in is linked to Host ID 61619 I sent this last trickle on 5 Jun 05 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=844975 and if you follow it, you can see that the Host ID on that page shows 61619 where my host ID is 165388 If there is someone looking into these cross-linked DB problems I hope this information can help Ray Doiron |
Send message Joined: 4 Nov 04 Posts: 16 Credit: 11,577,003 RAC: 0 |
Do we still have problems with misallocated WUs? My host 158490 has, allegedly, received result 923715, 001u_600025051_1 (created 7 Jun 2005 21:18:18 UTC, sent 8 Jun 2005 23:15:12 UTC). However, the host does not have any file named 001u_600025051.zip in its climateprediction.net directory where WUs are queued for crunching. If I see credits coming I'll check the hostid reported by the trickles. LS |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
> Do we still have problems with misallocated WUs? The recent server upgrade should prevent that from happening again. > My host 158490 has, allegedly, received result 923715, 001u_600025051_1 > (created 7 Jun 2005 21:18:18 UTC, sent 8 Jun 2005 23:15:12 UTC). However, the > host does not have any file named 001u_600025051.zip in its > climateprediction.net directory where WUs are queued for crunching. That will happen if the scheduler reply telling you to download that result went missing. If you look at your stdout.txt file you should see a request for work with no scheduler response followed by a request and reply that caused you to download 08rv_000016322. There is a pending BOINC change to align the work downloaded by a host with what the server database has allocated to it. <br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a> |
Send message Joined: 4 Nov 04 Posts: 16 Credit: 11,577,003 RAC: 0 |
Many thanks Thyme Lawn. I was looking for something like stdout or stderr to check if there were messages clarifying this issue, but without any luck. I do not have any stdout file neither in the BOINC directory nor in the climeprediction.net one (BOINC 4.19, hadsm 4.13, Linux). If this is just a scheduler matter, then all is well. LS |
Send message Joined: 22 Mar 05 Posts: 2 Credit: 371,502 RAC: 0 |
There is still a problem with misallocated WUs, even outside the range you've noted. I am HostID 138185. For the last 10 days, I was working on ResultID 895452, which relates to WU 582416. I have sent 22 trickles from this WU, with granted credit of 2079.39. However, none of this information shows up on my results page - there is no mention of the ResultID or WUID on the list, and it appears as if I have done nothing between 2nd June (when I finished my last unit) and today (when I have reset the project, thinking something had gone wrong somewhere.) Now, if you look at page http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=895452, you will see the data about this WU. However, it is allocated to HostID 49231. If you click on the "Trickle # 22" link, you will see that the trickles have been submitted between 2nd and 11th June. However, if you follow the link to HostID 49231, you'll see that they haven't sent any trickles since 15th May. Now, when you go to my hostID page (http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=138185) you'll see that the latest trickles show the HostID of this ResultID as "hidden". So something has gone wrong with this somewhere and it is hiding the HostID (and getting it wrong). The unfortunate thing is, since I couldn't see anything appearing on my results page, I thought that the results weren't getting through to you, so reset the project. This means that the 209 hours of work that I have done has now been wasted, which is a pity for your project as it will have to be started again by someone else. Unfortunately I didn't think to check the forums before resetting the project, only later. It may be worth noting that SETI and Einstein, which I also use, have useful information on their front page that I look at when there are problems - with no similar page on here, I don't have anything obvious to check against. It may be worth creating something like this for the future? I hope this is helpful to you in tracking down this problem. Kind regards, Laluki. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
> There is still a problem with misallocated WUs, even outside the range you've > noted. I am HostID 138185. For the last 10 days, I was working on ResultID > 895452, which relates to WU 582416. I have sent 22 trickles from this WU, with Previously in this thread, Thyme posted this: "It looks like the first post-upgrade result is 904515, so that's the actual cut-off point." |
Send message Joined: 30 Aug 04 Posts: 1 Credit: 61,134 RAC: 0 |
Thought I'd report another misallocated work unit in case you are still needing data. I've been working on 2uxz_300154762_0. My host id is 54134 and I'm at timestep 194436, phase 2. Result ID is 847965. This work unit appears to be credited to host id 147433, although it is showing a client error. My machine is still crunching away. |
Send message Joined: 6 Sep 04 Posts: 6 Credit: 195,123 RAC: 0 |
My problem work unit has now completed. On the results page, you can get the graphs for phases 2 and 3, but not for phase 1, nor for the full run. I did back up the work unit folder last night just before completion, so if anything needs to be redone, it should be possible to pick up most of the way through the unit. All the messages in my client seem to indicate that the work unit uploaded successfully. Hope all is working smoother now. |
Send message Joined: 7 Sep 04 Posts: 14 Credit: 160,054 RAC: 0 |
I'm currently processing a misallocated work unit (the work unit name in BOINC is 2vvo_300155986_1) which I'm over 50% of the way through. I don't want to kill it if the science will be useful and I don't particularly care whether I get credits or not for this one. Reading the guidelines it seems like I'd still be better to kill it off if it's being processed by someone else though. However, I'm just a little bit confused about how to tell whether my WU is being processed elsewhere or not... I've looked at at the trickles that are recorded against my account and from that list, clicked on the result ID (864094). This shows the name I'd expect (2vvo_300155986_1). When I click the link for the work unit (568245) this shows me a screen where the work unit name is 2vvo_300155986 : exactly the same as mine except without the _1 suffix at the end of the name. Two hosts are marked as crunching it, neither of them me, but both have crashed with a computing error. So on the face of it I'm better to continue assuming 2vvo_300155986 and 2vvo_300155986_1 are really the same thing. If I click on the host ID from the page for result 864094 I see a machine where the latest trickles are for an entirely different work unit. In fact it's one of the machines that crashed on 2vvo_300155986. So it looks like I ought to carry on with this unit. Is this analysis correct and I should continue with 2vvo_300155986_1 or should I kill if off? |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
Yes you should be fine continuing that WU if you don't mind about credits or are willing to continue and hope the credits will get sorted at some stage. There is no need to worry about the same work unit being sent out under a different resultid. As you say the computer that did complete the first trickle is now processing a different WU (and it isn't a hyperthreaded processor either). |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
From the problems Heiner explained here : http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2746 it looks very much as if those misallocated results can cause more database trouble when merging hosts. |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
Sorry I haven't understood your German. Are you sure it is misallocated WUs causing trouble rather than just database trouble causing problems with merging hosts at present? |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
Heiner tried to merge two hosts. This failed, probably because the WUs or the results are connected to a different host. It is very likely that "Couldn't update results" is a problem with the SQL itself so there might be a connection to the misallocated results. One would have to check the SQL error text though but we don't have that. He has credits but no trickles on one host so he is affected by the problem this thread refers to. ________ Most of my reply to Heiner was about a different problem, he had some swap space error too in one result - that was easier to explain in german because of the german windows settings. |
©2024 cpdn.org