climateprediction.net (CPDN) home page
Thread 'Batch 996 Weather@Home2 East Asia25'

Thread 'Batch 996 Weather@Home2 East Asia25'

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69911 - Posted: 17 Oct 2023, 9:54:54 UTC - in response to Message 69909.  

I'm waiting on confirmation they've increased the max allowed httpd connections and I've sent them details from a couple of users who have v large uploads waiting to investigate any DDoS protection blocking. But otherwise the server is working fine.
ID: 69911 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69912 - Posted: 17 Oct 2023, 11:27:38 UTC - in response to Message 69911.  
Last modified: 17 Oct 2023, 11:27:52 UTC

The max number of connections to the Korean upload server has been increased from 256 to 1000. At the time Andy@CPDN made the change there were 116 active connections. IT in Korea are investigating further and I'll report back if they find anything.
ID: 69912 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,027
RAC: 4,083
Message 69915 - Posted: 17 Oct 2023, 11:51:01 UTC - in response to Message 69912.  
Last modified: 17 Oct 2023, 12:49:35 UTC

Thanks Glenn.
Sadly either this change is going to take some time to actually cause an improvement in the situation or it wasn't the complete solution. I've still got three zip files failing at every retry:

wah2_eas25_a0uz_199012_24_996_012224663_2_r735015961_1.zip - 79.36% after 14:26 transfer time
wah2_eas25_a4ml_20142_24_996_012229545_2_r1812486379_8.zip - 47.67% after 15:20 transfer time
wah2_eas25_a4ml_20142_24_996_012229545_2_r1812486379_3.zip - 1.40% after 35:02 transfer time

Both tasks are still running. The first one, wah2_eas25_a0uz_199012_24_996, due to finish in about 4.5 days, and wah2_eas25_a4ml_20142_24_996 in just under 2 days.
Both my other tasks are uploading their zips in a timely manner, but even these can take a couple of retries (or, should that be a couple of retries?).
{edit to add}
The situation has gone backwards - all new uploads are descending rapidly into the re-try cycle, so the situation is certainly no better than it was before.
ID: 69915 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69917 - Posted: 17 Oct 2023, 11:58:35 UTC - in response to Message 69915.  

I don't know if it's the limit, could also just be network congestion causing the dropouts, but I'm not in close touch with the Korean side. I don't know what kind of bandwidth there is going to the server.

I've got same issue myself. Uploads currently stalled around 90% on their 60th retry. But equally I come back to my PC in the morning and previously stalled transfers with high retries have gone. Maybe during the night the congestion eases??

I know the Koreans IT guys are keen to investigate and I've passed on details of IP addresses to look at. Hopefully they will find out more.
ID: 69917 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 69918 - Posted: 17 Oct 2023, 12:37:26 UTC - in response to Message 69912.  

The max number of connections to the Korean upload server has been increased from 256 to 1000. At the time Andy@CPDN made the change there were 116 active connections. IT in Korea are investigating further and I'll report back if they find anything.


FWIW, no change from my end. Pending uploads is now up to 36 for me.
ID: 69918 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69920 - Posted: 17 Oct 2023, 15:53:48 UTC - in response to Message 69918.  
Last modified: 17 Oct 2023, 15:53:56 UTC

Talking with Andy the feeling is it's a bandwidth issue to S. Korea.
ID: 69920 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,703,308
RAC: 9,860
Message 69921 - Posted: 17 Oct 2023, 16:16:17 UTC - in response to Message 69920.  

Some time ago, I said I'd try to analyse my logs to see if upload times varied with time of day. Not in any recognisable way, seems to be the answer:


(except for that one outlier, of course)

10-Oct-2023 12:16:32 [climateprediction.net] Started upload of wah2_eas25_a02o_198512_24_996_012223644_0_r1306186109_15.zip
10-Oct-2023 13:27:39 [climateprediction.net] Finished upload of wah2_eas25_a02o_198512_24_996_012223644_0_r1306186109_15.zip (99616522 bytes)
Over an hour, but no sign of why.
ID: 69921 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 32
Credit: 40,990,926
RAC: 77,387
Message 69922 - Posted: 17 Oct 2023, 17:01:43 UTC

Things seem to be improving ever so slightly. I now don't have to scroll down the page to see all my tasks waiting to upload.

Can someone tell me what's going on with the file at the top of the list? Progress shows 100%. It's actually more like 180%. How is that even possible?

ID: 69922 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69924 - Posted: 17 Oct 2023, 17:12:45 UTC - in response to Message 69922.  

Can someone tell me what's going on with the file at the top of the list? Progress shows 100%. It's actually more like 180%. How is that even possible?
I'm guessing it's packet loss & resends. The client's counting the total packets sent?
ID: 69924 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,027
RAC: 4,083
Message 69925 - Posted: 17 Oct 2023, 17:26:28 UTC - in response to Message 69924.  

I had one like that a short time ago, after digging through the log file it was fairly obvious that the BOINC client does get "somewhat confused" periodically and counts packets sent (but not acknowledged) as having arrived safely, and thus they are counted to the total transmitted. In my case a subsequent re-try reset the figure to zero, then to a more accurate value.
ID: 69925 · Report as offensive     Reply Quote
Tomcat

Send message
Joined: 29 May 15
Posts: 17
Credit: 717,192
RAC: 12,206
Message 69928 - Posted: 17 Oct 2023, 21:22:17 UTC - in response to Message 69898.  

While increasing the number of connections should help by reducing the number of "first time stalls", there appears to be an issue that's leading to tasks with high re-try counts. The symptom is that once a task reaches a certain number of re-tries it becomes increasing more probable that it will fail on it's next attempt, so thinking out loud here, is the time-out time before declaring a failure too short for the "find this zip" time?


Yeah. I've noticed this too. I have 5 files that will not upload no matter what. The percentage never changes on those. Three of them are extremely old, trickle numbers 2, 3, and 9 out of over 20. The rest have all uploaded successfully. I cannot tell if this is coincidence, like how limiting my upload speeds seemed to finally get uploads going.

BTW, how does one view the retry count?
ID: 69928 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69931 - Posted: 18 Oct 2023, 11:32:25 UTC

Upload news
We've had confirmation that the security policy on the http port at the S.Korea site is blocking some connections to the upload server due to the high number of attempts. Not unsurprisingly the site does not want to open up the port, so CPDN is going to switch the upload address to the UK JASMIN site (the upload URL is just an alias and can be pointed to other machines). This should happen later today and then it'll take a day or so for the change to propagate through the nameservers.

In case anyone is wondering, the JASMIN upload server sits outside the main firewall at the site and doesn't have the same problem. This problem to S.Korea wasn't seen with earlier batches because the total uploads was much less.

So please don't abort any outstanding transfers.
---
CPDN Visiting Scientist
ID: 69931 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4539
Credit: 19,008,987
RAC: 21,524
Message 69933 - Posted: 18 Oct 2023, 11:49:27 UTC - in response to Message 69931.  

Great to have an explanation.
Thanks Glen.
ID: 69933 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,703,308
RAC: 9,860
Message 69934 - Posted: 18 Oct 2023, 12:31:16 UTC

One of the problems is that neither the user, nor the project, has any control over the retry interval for a file upload. One of my hosts caught that for this current eas run:

11-Oct-2023 00:29:44 [climateprediction.net] Started upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 00:32:49 [climateprediction.net] Temporarily failed upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip: transient HTTP error
11-Oct-2023 00:32:49 [climateprediction.net] Backing off 00:02:11 on upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 00:35:01 [climateprediction.net] Started upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 00:35:02 [climateprediction.net] Error reported by file upload server: [wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip] locked by file_upload_handler PID=3911898
11-Oct-2023 00:35:02 [climateprediction.net] Temporarily failed upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip: transient upload error
11-Oct-2023 00:35:02 [climateprediction.net] Backing off 00:04:15 on upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 00:39:18 [climateprediction.net] Started upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 00:39:19 [climateprediction.net] Error reported by file upload server: [wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip] locked by file_upload_handler PID=3911898
11-Oct-2023 00:39:19 [climateprediction.net] Temporarily failed upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip: transient upload error
11-Oct-2023 00:39:19 [climateprediction.net] Backing off 00:12:24 on upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 00:52:33 [climateprediction.net] Started upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 00:52:35 [climateprediction.net] Error reported by file upload server: [wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip] locked by file_upload_handler PID=3911898
11-Oct-2023 00:52:35 [climateprediction.net] Temporarily failed upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip: transient upload error
11-Oct-2023 00:52:35 [climateprediction.net] Backing off 00:25:59 on upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 01:24:01 [climateprediction.net] Started upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 01:24:03 [climateprediction.net] Error reported by file upload server: [wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip] locked by file_upload_handler PID=3911898
11-Oct-2023 01:24:03 [climateprediction.net] Temporarily failed upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip: transient upload error
11-Oct-2023 01:24:03 [climateprediction.net] Backing off 00:51:04 on upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 02:22:09 [climateprediction.net] Started upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip
11-Oct-2023 02:23:23 [climateprediction.net] Finished upload of wah2_eas25_a02v_198512_24_996_012223651_0_r1226154871_14.zip (99189131 bytes)
The delay starts at somewhere around 2 minutes (clearly too short for Korea), and roughly doubles with each attempt: the exact values are randomised, so that different files don't end up retrying in lockstep.

It might be better if projects could set a 'minimum delay' figure for uploads, as they already can for scheduler contacts. But that would be a difficult change, and I can't see BOINC picking up on it, in its current state. Rollout would also be slow.

Incidentally, this log section gives a sort-of answer for the PID lockout question. 1 hour wasn't enough - try 2 hours.
ID: 69934 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 32
Credit: 40,990,926
RAC: 77,387
Message 69935 - Posted: 18 Oct 2023, 14:06:57 UTC

Thanks for fixing the problem or finding a workaround! All my tasks have uploaded! :)
ID: 69935 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4539
Credit: 19,008,987
RAC: 21,524
Message 69937 - Posted: 18 Oct 2023, 15:44:30 UTC

The big jump in the number of users reporting tasks is I think evidence that switching to Jasmine has worked.

CM3 short 0 1093 --- 0
Weather At Home 2 (wah2) 0 22932 241.2 (87.72 - 576.95) 45
ID: 69937 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,027
RAC: 4,083
Message 69938 - Posted: 18 Oct 2023, 16:09:56 UTC - in response to Message 69931.  

Should our computers "automagically" connect to Jasmin now, or will that take some time?

The reason I ask is that mine is still looking at, what I believe to be the Korean serve "upload7.cpnd.org" on ip address 141.223.16.156, port 80.
ID: 69938 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4539
Credit: 19,008,987
RAC: 21,524
Message 69939 - Posted: 18 Oct 2023, 16:13:54 UTC - in response to Message 69938.  

You are right, looks like it's not been changed over yet. But something has changed judging by the numbers.
ID: 69939 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 69940 - Posted: 18 Oct 2023, 18:18:05 UTC - in response to Message 69938.  

Should our computers "automagically" connect to Jasmin now, or will that take some time?

The reason I ask is that mine is still looking at, what I believe to be the Korean serve "upload7.cpnd.org" on ip address 141.223.16.156, port 80.

Wouldn't this quote from Glenn explain why some may see the changeover faster than others? italics mine

"We've had confirmation that the security policy on the http port at the S.Korea site is blocking some connections to the upload server due to the high number of attempts. Not unsurprisingly the site does not want to open up the port, so CPDN is going to switch the upload address to the UK JASMIN site (the upload URL is just an alias and can be pointed to other machines). This should happen later today and then it'll take a day or so for the change to propagate through the nameservers. "
ID: 69940 · Report as offensive     Reply Quote
Tomcat

Send message
Joined: 29 May 15
Posts: 17
Credit: 717,192
RAC: 12,206
Message 69941 - Posted: 19 Oct 2023, 7:30:05 UTC

Uploads complete!

Let's hope the issue has been fixed.
ID: 69941 · Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25

©2024 cpdn.org