climateprediction.net (CPDN) home page
Thread 'Connection and Download issues Oct24'

Thread 'Connection and Download issues Oct24'

Message boards : Number crunching : Connection and Download issues Oct24
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,741,514
RAC: 87,063
Message 71792 - Posted: 1 Nov 2024, 16:06:27 UTC - in response to Message 71791.  

All systems are GO !
ID: 71792 · Report as offensive     Reply Quote
TLD

Send message
Joined: 11 Dec 05
Posts: 14
Credit: 2,216,060
RAC: 7,493
Message 71793 - Posted: 1 Nov 2024, 16:29:39 UTC - in response to Message 71792.  

WUs are in progress here, thank you.
ID: 71793 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,741,514
RAC: 87,063
Message 71795 - Posted: 1 Nov 2024, 17:14:11 UTC

Server status page doesn't seem to be updating...

Task data as of 1 Nov 2024, 15:25:53 UTC
ID: 71795 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,741,514
RAC: 87,063
Message 71796 - Posted: 1 Nov 2024, 17:45:29 UTC

Server page updating, thanks.
Free-DC has picked up yesterday's stats files.
ID: 71796 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71798 - Posted: 1 Nov 2024, 19:24:45 UTC

Now boinc in VM is working as well as in WINE. Still would like to understand why they behaved differently though.
ID: 71798 · Report as offensive     Reply Quote
makracz

Send message
Joined: 9 May 24
Posts: 1
Credit: 1,562,359
RAC: 12,234
Message 71799 - Posted: 1 Nov 2024, 19:25:25 UTC

Are all new work units already gone?
ID: 71799 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71800 - Posted: 1 Nov 2024, 19:51:46 UTC - in response to Message 71799.  
Last modified: 1 Nov 2024, 20:59:37 UTC

Are all new work units already gone?

According to the server status page they have. Most of mine are eas tasks that have timed out on other machines.
ID: 71800 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,884,880
RAC: 19,188
Message 71801 - Posted: 2 Nov 2024, 5:31:38 UTC - in response to Message 71800.  
Last modified: 2 Nov 2024, 5:32:17 UTC

Most of mine are eas tasks that have timed out on other machines.

I got a chunk of these too but it looks like almost all of them will be finished by the original users way before I can finish them. I'm going to suspend them instead of spending time on them for likely no benefit.
ID: 71801 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71802 - Posted: 2 Nov 2024, 6:29:33 UTC - in response to Message 71801.  

Most of mine are eas tasks that have timed out on other machines.

I got a chunk of these too but it looks like almost all of them will be finished by the original users way before I can finish them. I'm going to suspend them instead of spending time on them for likely no benefit.
I looked at three of the machines that had been running these tasks. I am pretty sure most if not all of those I have will finish first on my machine. All three of the machines I looked at have well over 50% error rate as well so there is some doubt whether they would ever finish on the original machines.
ID: 71802 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,716,561
RAC: 8,355
Message 71806 - Posted: 3 Nov 2024, 17:45:29 UTC

Well, that was a nice quiet weekend. From what I can see, once the logjam was released late on Friday afternoon, everything has been running as it should. Uploads and trickle reports have be sent to their respective destinations, task pages show that credit awards have been made in real time, and the external aggregation sites have been able to collect their data packages as normal. Of course, the relatively few remaining tasks in this batch were scooped up very quickly, so we can't confirm just yet that every host that requests work can be serviced. But it's looking good.

The Friday restart was the completion of the recovery process, with DNS and SSL returned to their status quo ante. But that leaves some space to consider the initial cause of the problems - the one which made it impossible to download fresh copies of the application files where needed.

After looking through the logs, that seemed to me to be an attempt to deploy 'cloudflare' - a transparent caching program. This would actually be very useful to the project - it can save a huge amount of (paid-for) bandwidth when new applications are to be deployed. According to Glenn, "The next project to go out will be using the HadAM4 N216 application, linux only." - once final development tweaks to the application have been added and tested. So that's exactly the situation where cloudflare would be helpful.

I would hope that the team will use this quiet break between batches to double-check the cloudflare manual and try again (and if they weren't planning to, I would suggest it!). But this time, please test it while things are resting, not in the heat of a batch release!
ID: 71806 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,741,514
RAC: 87,063
Message 71807 - Posted: 3 Nov 2024, 19:46:58 UTC - in response to Message 71806.  

Maybe here !

Still have 1 task that can't upload the final _out.zip, gets as far as 1.31/4.75 MB, log says transient HHTP error.
5 trickles reported on Friday afternoon at the same time, has all it's credit.
3 other tasks downloaded before the trickles went up and another task came down this afternoon without a problem.
ID: 71807 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,716,561
RAC: 8,355
Message 71808 - Posted: 3 Nov 2024, 21:17:03 UTC - in response to Message 71807.  

Uploads go direct to the climate researchers who commissioned the batch - in this case, in New Zealand. They don't follow the administrative route to Oxford.
ID: 71808 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,884,880
RAC: 19,188
Message 71809 - Posted: 4 Nov 2024, 1:36:44 UTC - in response to Message 71802.  

Most of mine are eas tasks that have timed out on other machines.

I got a chunk of these too but it looks like almost all of them will be finished by the original users way before I can finish them. I'm going to suspend them instead of spending time on them for likely no benefit.
I looked at three of the machines that had been running these tasks. I am pretty sure most if not all of those I have will finish first on my machine. All three of the machines I looked at have well over 50% error rate as well so there is some doubt whether they would ever finish on the original machines.

As expected, the other user has finished the tasks I got due to a time-out. However, I still have the tasks and they show as In Progress. I'd have expected for the server to cancel them, like Rosetta does, in situations like this. How does CPDN handle such cases? Do I have to abort them myself?
ID: 71809 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71810 - Posted: 4 Nov 2024, 8:14:09 UTC - in response to Message 71809.  

Of mine, 4 have completed. The rest I have overtaken the original machine or am very close to having done so. I am going to suspend the ones that have completed but suspect Glen will suggest deleting them. The only reason I can think of for letting them complete would be if someone wanted to compare results on different architecture machines.
ID: 71810 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71811 - Posted: 4 Nov 2024, 12:04:22 UTC - in response to Message 71807.  

Still have 1 task that can't upload the final _out.zip, gets as far as 1.31/4.75 MB, log says transient HHTP error.
Has that last out.zip cleared? As your computers are hidden I can't check anything. (Not a request to unhide them, just an explanation.)
ID: 71811 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,741,514
RAC: 87,063
Message 71812 - Posted: 4 Nov 2024, 12:53:37 UTC - in response to Message 71811.  

No, not yet.

It is an eas25 batch 1021 task but I have more of those running in the same client and a different client on the same machine that are having no problems uploading their zip files. It would appear something at the far end doesn't want to talk about that task yet. All the ones I've had like this do eventually sort themselves out.
ID: 71812 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71813 - Posted: 4 Nov 2024, 13:10:12 UTC

All the ones I've had like this do eventually sort themselves out.
That is my experience too.
If you enable http debug do you get something like "locked by file upload handler?" That happens when something has interrupted the upload of the file. I don't know what the backoff time on the server is before it allows you to resume the upload but I have had a number of occasions when it has been several hours.
ID: 71813 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,741,514
RAC: 87,063
Message 71817 - Posted: 4 Nov 2024, 14:18:43 UTC - in response to Message 71813.  

It finds upload7.cpdn.org in the DNS cache and connects to upload7.cpdn.org
Usual C3PO/R2D2 gibberish and then gets an Info: Recv failure: Connection was reset
and then HTTP error: Failure when receiving data from the peer

Have tried flushing local DNS cache but still the same error.

Others have taken many days too, maybe a week is enough.
ID: 71817 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 71819 - Posted: 4 Nov 2024, 15:05:54 UTC - in response to Message 71809.  

As expected, the other user has finished the tasks I got due to a time-out. However, I still have the tasks and they show as In Progress. I'd have expected for the server to cancel them, like Rosetta does, in situations like this. How does CPDN handle such cases? Do I have to abort them myself?
I checked with Andy about this. CPDN doesn't issue a 'not needed' response if a earlier task in the workunit finishes. Experience has taught them users get annoyed by tasks being killed. So, yes, you'll need to abort it yourself.
---
CPDN Visiting Scientist
ID: 71819 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71820 - Posted: 4 Nov 2024, 15:43:22 UTC - in response to Message 71819.  
Last modified: 4 Nov 2024, 15:54:40 UTC

I checked with Andy about this. CPDN doesn't issue a 'not needed' response if a earlier task in the workunit finishes. Experience has taught them users get annoyed by tasks being killed. So, yes, you'll need to abort it yourself

If only BOINC had an option to say you were more interested in the science than in credit allowing unwanted tasks to be killed by the project for those people. On checking through the tasks, it was just three on my box that had completed by today. At least two hadn't even started so unless the person (not) running them has a very fast computer, there isn't much doubt my Ryzen9 will get in first.

Edit:If I had a vote, it would be for the tasks to be deleted. It might cut down on the numbers crunching for CPDN but over time might weed out some habitual very slow returners. But I get that such decisions are way above my pay grade. I am not intending to make waves by expressing my opinion!
ID: 71820 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Connection and Download issues Oct24

©2024 cpdn.org