Message boards : Number crunching : Connection and Download issues Oct24
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,456,138 RAC: 90,728 |
All systems are GO ! |
Send message Joined: 11 Dec 05 Posts: 14 Credit: 2,182,884 RAC: 6,875 |
WUs are in progress here, thank you. |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,456,138 RAC: 90,728 |
Server status page doesn't seem to be updating... Task data as of 1 Nov 2024, 15:25:53 UTC |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,456,138 RAC: 90,728 |
Server page updating, thanks. Free-DC has picked up yesterday's stats files. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,004,017 RAC: 21,574 |
Now boinc in VM is working as well as in WINE. Still would like to understand why they behaved differently though. |
Send message Joined: 9 May 24 Posts: 1 Credit: 1,515,930 RAC: 11,940 |
Are all new work units already gone? |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,004,017 RAC: 21,574 |
Are all new work units already gone? According to the server status page they have. Most of mine are eas tasks that have timed out on other machines. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,819,420 RAC: 19,777 |
Most of mine are eas tasks that have timed out on other machines. I got a chunk of these too but it looks like almost all of them will be finished by the original users way before I can finish them. I'm going to suspend them instead of spending time on them for likely no benefit. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,004,017 RAC: 21,574 |
I looked at three of the machines that had been running these tasks. I am pretty sure most if not all of those I have will finish first on my machine. All three of the machines I looked at have well over 50% error rate as well so there is some doubt whether they would ever finish on the original machines.Most of mine are eas tasks that have timed out on other machines. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
Well, that was a nice quiet weekend. From what I can see, once the logjam was released late on Friday afternoon, everything has been running as it should. Uploads and trickle reports have be sent to their respective destinations, task pages show that credit awards have been made in real time, and the external aggregation sites have been able to collect their data packages as normal. Of course, the relatively few remaining tasks in this batch were scooped up very quickly, so we can't confirm just yet that every host that requests work can be serviced. But it's looking good. The Friday restart was the completion of the recovery process, with DNS and SSL returned to their status quo ante. But that leaves some space to consider the initial cause of the problems - the one which made it impossible to download fresh copies of the application files where needed. After looking through the logs, that seemed to me to be an attempt to deploy 'cloudflare' - a transparent caching program. This would actually be very useful to the project - it can save a huge amount of (paid-for) bandwidth when new applications are to be deployed. According to Glenn, "The next project to go out will be using the HadAM4 N216 application, linux only." - once final development tweaks to the application have been added and tested. So that's exactly the situation where cloudflare would be helpful. I would hope that the team will use this quiet break between batches to double-check the cloudflare manual and try again (and if they weren't planning to, I would suggest it!). But this time, please test it while things are resting, not in the heat of a batch release! |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,456,138 RAC: 90,728 |
Maybe here ! Still have 1 task that can't upload the final _out.zip, gets as far as 1.31/4.75 MB, log says transient HHTP error. 5 trickles reported on Friday afternoon at the same time, has all it's credit. 3 other tasks downloaded before the trickles went up and another task came down this afternoon without a problem. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
Uploads go direct to the climate researchers who commissioned the batch - in this case, in New Zealand. They don't follow the administrative route to Oxford. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,819,420 RAC: 19,777 |
I looked at three of the machines that had been running these tasks. I am pretty sure most if not all of those I have will finish first on my machine. All three of the machines I looked at have well over 50% error rate as well so there is some doubt whether they would ever finish on the original machines.Most of mine are eas tasks that have timed out on other machines. As expected, the other user has finished the tasks I got due to a time-out. However, I still have the tasks and they show as In Progress. I'd have expected for the server to cancel them, like Rosetta does, in situations like this. How does CPDN handle such cases? Do I have to abort them myself? |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,004,017 RAC: 21,574 |
Of mine, 4 have completed. The rest I have overtaken the original machine or am very close to having done so. I am going to suspend the ones that have completed but suspect Glen will suggest deleting them. The only reason I can think of for letting them complete would be if someone wanted to compare results on different architecture machines. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,004,017 RAC: 21,574 |
Still have 1 task that can't upload the final _out.zip, gets as far as 1.31/4.75 MB, log says transient HHTP error.Has that last out.zip cleared? As your computers are hidden I can't check anything. (Not a request to unhide them, just an explanation.) |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,456,138 RAC: 90,728 |
No, not yet. It is an eas25 batch 1021 task but I have more of those running in the same client and a different client on the same machine that are having no problems uploading their zip files. It would appear something at the far end doesn't want to talk about that task yet. All the ones I've had like this do eventually sort themselves out. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,004,017 RAC: 21,574 |
All the ones I've had like this do eventually sort themselves out.That is my experience too. If you enable http debug do you get something like "locked by file upload handler?" That happens when something has interrupted the upload of the file. I don't know what the backoff time on the server is before it allows you to resume the upload but I have had a number of occasions when it has been several hours. |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,456,138 RAC: 90,728 |
It finds upload7.cpdn.org in the DNS cache and connects to upload7.cpdn.org Usual C3PO/R2D2 gibberish and then gets an Info: Recv failure: Connection was reset and then HTTP error: Failure when receiving data from the peer Have tried flushing local DNS cache but still the same error. Others have taken many days too, maybe a week is enough. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
As expected, the other user has finished the tasks I got due to a time-out. However, I still have the tasks and they show as In Progress. I'd have expected for the server to cancel them, like Rosetta does, in situations like this. How does CPDN handle such cases? Do I have to abort them myself?I checked with Andy about this. CPDN doesn't issue a 'not needed' response if a earlier task in the workunit finishes. Experience has taught them users get annoyed by tasks being killed. So, yes, you'll need to abort it yourself. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,004,017 RAC: 21,574 |
I checked with Andy about this. CPDN doesn't issue a 'not needed' response if a earlier task in the workunit finishes. Experience has taught them users get annoyed by tasks being killed. So, yes, you'll need to abort it yourself If only BOINC had an option to say you were more interested in the science than in credit allowing unwanted tasks to be killed by the project for those people. On checking through the tasks, it was just three on my box that had completed by today. At least two hadn't even started so unless the person (not) running them has a very fast computer, there isn't much doubt my Ryzen9 will get in first. Edit:If I had a vote, it would be for the tasks to be deleted. It might cut down on the numbers crunching for CPDN but over time might weed out some habitual very slow returners. But I get that such decisions are way above my pay grade. I am not intending to make waves by expressing my opinion! |
©2024 cpdn.org