Thread 'Connection and Download issues Oct24'

Author	Message
PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,885 RAC: 90,675	Message 71792 - Posted: 1 Nov 2024, 16:06:27 UTC - in response to Message 71791. All systems are GO ! ID: 71792 · Reply Quote

TLD Send message Joined: 11 Dec 05 Posts: 14 Credit: 2,182,055 RAC: 6,821	Message 71793 - Posted: 1 Nov 2024, 16:29:39 UTC - in response to Message 71792. WUs are in progress here, thank you. ID: 71793 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,885 RAC: 90,675	Message 71795 - Posted: 1 Nov 2024, 17:14:11 UTC Server status page doesn't seem to be updating... Task data as of 1 Nov 2024, 15:25:53 UTC ID: 71795 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,885 RAC: 90,675	Message 71796 - Posted: 1 Nov 2024, 17:45:29 UTC Server page updating, thanks. Free-DC has picked up yesterday's stats files. ID: 71796 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726	Message 71798 - Posted: 1 Nov 2024, 19:24:45 UTC Now boinc in VM is working as well as in WINE. Still would like to understand why they behaved differently though. ID: 71798 · Reply Quote

makracz Send message Joined: 9 May 24 Posts: 1 Credit: 1,514,273 RAC: 12,136	Message 71799 - Posted: 1 Nov 2024, 19:25:25 UTC Are all new work units already gone? ID: 71799 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726	Message 71800 - Posted: 1 Nov 2024, 19:51:46 UTC - in response to Message 71799. Last modified: 1 Nov 2024, 20:59:37 UTC Are all new work units already gone? According to the server status page they have. Most of mine are eas tasks that have timed out on other machines. ID: 71800 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,816,935 RAC: 19,934	Message 71801 - Posted: 2 Nov 2024, 5:31:38 UTC - in response to Message 71800. Last modified: 2 Nov 2024, 5:32:17 UTC Most of mine are eas tasks that have timed out on other machines. I got a chunk of these too but it looks like almost all of them will be finished by the original users way before I can finish them. I'm going to suspend them instead of spending time on them for likely no benefit. ID: 71801 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726	Message 71802 - Posted: 2 Nov 2024, 6:29:33 UTC - in response to Message 71801. Most of mine are eas tasks that have timed out on other machines. I got a chunk of these too but it looks like almost all of them will be finished by the original users way before I can finish them. I'm going to suspend them instead of spending time on them for likely no benefit. I looked at three of the machines that had been running these tasks. I am pretty sure most if not all of those I have will finish first on my machine. All three of the machines I looked at have well over 50% error rate as well so there is some doubt whether they would ever finish on the original machines. ID: 71802 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977	Message 71806 - Posted: 3 Nov 2024, 17:45:29 UTC Well, that was a nice quiet weekend. From what I can see, once the logjam was released late on Friday afternoon, everything has been running as it should. Uploads and trickle reports have be sent to their respective destinations, task pages show that credit awards have been made in real time, and the external aggregation sites have been able to collect their data packages as normal. Of course, the relatively few remaining tasks in this batch were scooped up very quickly, so we can't confirm just yet that every host that requests work can be serviced. But it's looking good. The Friday restart was the completion of the recovery process, with DNS and SSL returned to their status quo ante. But that leaves some space to consider the initial cause of the problems - the one which made it impossible to download fresh copies of the application files where needed. After looking through the logs, that seemed to me to be an attempt to deploy 'cloudflare' - a transparent caching program. This would actually be very useful to the project - it can save a huge amount of (paid-for) bandwidth when new applications are to be deployed. According to Glenn, "The next project to go out will be using the HadAM4 N216 application, linux only." - once final development tweaks to the application have been added and tested. So that's exactly the situation where cloudflare would be helpful. I would hope that the team will use this quiet break between batches to double-check the cloudflare manual and try again (and if they weren't planning to, I would suggest it!). But this time, please test it while things are resting, not in the heat of a batch release! ID: 71806 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,885 RAC: 90,675	Message 71807 - Posted: 3 Nov 2024, 19:46:58 UTC - in response to Message 71806. Maybe here ! Still have 1 task that can't upload the final _out.zip, gets as far as 1.31/4.75 MB, log says transient HHTP error. 5 trickles reported on Friday afternoon at the same time, has all it's credit. 3 other tasks downloaded before the trickles went up and another task came down this afternoon without a problem. ID: 71807 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977	Message 71808 - Posted: 3 Nov 2024, 21:17:03 UTC - in response to Message 71807. Uploads go direct to the climate researchers who commissioned the batch - in this case, in New Zealand. They don't follow the administrative route to Oxford. ID: 71808 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,816,935 RAC: 19,934	Message 71809 - Posted: 4 Nov 2024, 1:36:44 UTC - in response to Message 71802. Most of mine are eas tasks that have timed out on other machines. I got a chunk of these too but it looks like almost all of them will be finished by the original users way before I can finish them. I'm going to suspend them instead of spending time on them for likely no benefit. I looked at three of the machines that had been running these tasks. I am pretty sure most if not all of those I have will finish first on my machine. All three of the machines I looked at have well over 50% error rate as well so there is some doubt whether they would ever finish on the original machines. As expected, the other user has finished the tasks I got due to a time-out. However, I still have the tasks and they show as In Progress. I'd have expected for the server to cancel them, like Rosetta does, in situations like this. How does CPDN handle such cases? Do I have to abort them myself? ID: 71809 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726	Message 71810 - Posted: 4 Nov 2024, 8:14:09 UTC - in response to Message 71809. Of mine, 4 have completed. The rest I have overtaken the original machine or am very close to having done so. I am going to suspend the ones that have completed but suspect Glen will suggest deleting them. The only reason I can think of for letting them complete would be if someone wanted to compare results on different architecture machines. ID: 71810 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726	Message 71811 - Posted: 4 Nov 2024, 12:04:22 UTC - in response to Message 71807. Still have 1 task that can't upload the final _out.zip, gets as far as 1.31/4.75 MB, log says transient HHTP error. Has that last out.zip cleared? As your computers are hidden I can't check anything. (Not a request to unhide them, just an explanation.) ID: 71811 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,885 RAC: 90,675	Message 71812 - Posted: 4 Nov 2024, 12:53:37 UTC - in response to Message 71811. No, not yet. It is an eas25 batch 1021 task but I have more of those running in the same client and a different client on the same machine that are having no problems uploading their zip files. It would appear something at the far end doesn't want to talk about that task yet. All the ones I've had like this do eventually sort themselves out. ID: 71812 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726	Message 71813 - Posted: 4 Nov 2024, 13:10:12 UTC All the ones I've had like this do eventually sort themselves out. That is my experience too. If you enable http debug do you get something like "locked by file upload handler?" That happens when something has interrupted the upload of the file. I don't know what the backoff time on the server is before it allows you to resume the upload but I have had a number of occasions when it has been several hours. ID: 71813 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,885 RAC: 90,675	Message 71817 - Posted: 4 Nov 2024, 14:18:43 UTC - in response to Message 71813. It finds upload7.cpdn.org in the DNS cache and connects to upload7.cpdn.org Usual C3PO/R2D2 gibberish and then gets an Info: Recv failure: Connection was reset and then HTTP error: Failure when receiving data from the peer Have tried flushing local DNS cache but still the same error. Others have taken many days too, maybe a week is enough. ID: 71817 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 71819 - Posted: 4 Nov 2024, 15:05:54 UTC - in response to Message 71809. As expected, the other user has finished the tasks I got due to a time-out. However, I still have the tasks and they show as In Progress. I'd have expected for the server to cancel them, like Rosetta does, in situations like this. How does CPDN handle such cases? Do I have to abort them myself? I checked with Andy about this. CPDN doesn't issue a 'not needed' response if a earlier task in the workunit finishes. Experience has taught them users get annoyed by tasks being killed. So, yes, you'll need to abort it yourself. --- CPDN Visiting Scientist ID: 71819 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726	Message 71820 - Posted: 4 Nov 2024, 15:43:22 UTC - in response to Message 71819. Last modified: 4 Nov 2024, 15:54:40 UTC I checked with Andy about this. CPDN doesn't issue a 'not needed' response if a earlier task in the workunit finishes. Experience has taught them users get annoyed by tasks being killed. So, yes, you'll need to abort it yourself If only BOINC had an option to say you were more interested in the science than in credit allowing unwanted tasks to be killed by the project for those people. On checking through the tasks, it was just three on my box that had completed by today. At least two hadn't even started so unless the person (not) running them has a very fast computer, there isn't much doubt my Ryzen9 will get in first. Edit:If I had a vote, it would be for the tasks to be deleted. It might cut down on the numbers crunching for CPDN but over time might weed out some habitual very slow returners. But I get that such decisions are way above my pay grade. I am not intending to make waves by expressing my opinion! ID: 71820 · Reply Quote