climateprediction.net (CPDN) home page
Thread 'The uploads are stuck'

Thread 'The uploads are stuck'

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 25 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67260 - Posted: 3 Jan 2023, 19:13:35 UTC - in response to Message 67258.  

Bunch of files uploaded then it started getting http error at about 15:30. 15:50, started working again then more errors from 17:10 onwards. May just be the server being overloaded though.


For me, I got two batches. The second batch had a few skips, but usually a minute or less.

10:13:32 to 10:22:37 EST so add 5 hours for UTC
11:40:17 to 12:36:02 EST so add 5 hours for UTC

Since then, many tries but no successes.
ID: 67260 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67261 - Posted: 3 Jan 2023, 20:18:48 UTC
Last modified: 3 Jan 2023, 20:52:38 UTC

Bunch of files uploaded then it started getting http error at about 15:30. 15:50, started working again then more errors from 17:10 onwards. May just be the server being overloaded though.
This posted on projects message board with folowing reply from Andy:


I am afraid you are right. It's lost both it's SSH and HTTP ports again. I have sent another email to Matt in JASMIN Support to ask him to investigate this again tomorrow morning.
ID: 67261 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 67264 - Posted: 3 Jan 2023, 22:05:58 UTC

Open file count hitting limits and it can't create new sockets?

If he could get serial console access to it, that should be independent of SSH connections, to investigate and see what's going on. Or remote syslog to something he's got access to. Troubleshooting when it "just locks up" is brutal. :( I've also seen that sort of behavior from running a machine very badly out of RAM. I've no idea how much processing is happening on the upload box vs later on (zips being extracted, etc), but with the amount of uploads flowing in, any sort of "post-processing" tasks could easily thrash the machine into oblivion - the fun of recovering from non-steady-state operation is that you now have to be able to handle all the traffic queued up.

A bunch of my machines refuse to download new tasks as there are too many uploads in progress - not sure if there's a way around it.
ID: 67264 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67272 - Posted: 4 Jan 2023, 2:47:20 UTC - in response to Message 67245.  

traceroute eventually makes its way to the proper destination.

It doesn't for me.
Actually, I think it does because it seems to fail the same way it is doing for others. I just wonder why traceroute fails like this Or is DNS failing somewhere along the line?
localhost:jeandavid8[~]$ date
Tue Jan  3 21:17:56 EST 2023
localhost:jeandavid8[~]$ traceroute upload11.cpdn.org
traceroute to upload11.cpdn.org (192.171.169.187), 30 hops max, 60 byte packets
 1  Fios_Quantum_Gateway.fios-router.home (192.168.0.1)  0.317 ms  0.454 ms  0.590 ms
 2  lo0-100.NWRKNJ-VFTTP-309.verizon-gni.net (71.127.205.1)  5.251 ms  8.144 ms  8.196 ms
 3  at-0-0-0-1716.ALT2-CORE-RTR1.verizon-gni.net (100.41.5.68)  8.258 ms at-0-0-0-1717.ALT2-CORE-RTR2.verizon-gni.net (100.41.5.70)  10.218 ms  8.314 ms
 4  0.csi1.NBWKNJNB-MSE01-BB-SU1.ALTER.NET (140.222.4.106)  10.335 ms 0.csi1.NWRKNJ02-MSE01-BB-SU1.ALTER.NET (140.222.4.104)  13.201 ms 0.csi1.NBWKNJNB-MSE01-BB-SU1.ALTER.NET (140.222.4.106)  12.735 ms
 5  * * *
 6  * * *
 7  * * *
 8  nyk-bb2-link.ip.twelve99.net (62.115.135.162)  8.699 ms * *
 9  ldn-bb1-link.ip.twelve99.net (62.115.113.21)  77.718 ms  77.820 ms  77.458 ms
10  ldn-b2-link.ip.twelve99.net (62.115.122.189)  79.894 ms ldn-b2-link.ip.twelve99.net (62.115.120.239)  77.523 ms  74.854 ms
11  jisc-ic345131-ldn-b2.ip.twelve99-cust.net (62.115.175.131)  77.402 ms  77.977 ms  75.352 ms
12  ae24.londhx-sbr1.ja.net (146.97.35.197)  77.833 ms  77.979 ms  77.380 ms
13  ae29.londpg-sbr2.ja.net (146.97.33.2)  97.289 ms  95.016 ms  89.015 ms
14  ae31.erdiss-sbr2.ja.net (146.97.33.22)  80.207 ms  80.350 ms  82.997 ms
15  * * *
16  ral-r26.ja.net (146.97.41.34)  79.900 ms  81.510 ms  84.112 ms
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *
localhost:jeandavid8[~]$ 

ID: 67272 · Report as offensive     Reply Quote
ProfileSaenger
Avatar

Send message
Joined: 1 Nov 04
Posts: 185
Credit: 4,166,063
RAC: 857
Message 67274 - Posted: 4 Jan 2023, 5:36:47 UTC

My traceroute:
saenger@Sanger-sein-Mint:~$ traceroute upload11.cpdn.org
traceroute to upload11.cpdn.org (192.171.169.187), 30 hops max, 60 byte packets
 1  fritz.box (192.168.178.1)  0.546 ms  0.803 ms  1.021 ms
 2  85.16.121.109 (85.16.121.109)  15.733 ms  15.779 ms  15.824 ms
 3  mprt-hb-70740110-xe-2-2-0.ewe-ip-backbone.de (85.16.250.59)  15.886 ms  15.922 ms  15.946 ms
 4  mprt-hb-70740101-xe-1-2-1.ewe-ip-backbone.de (85.16.250.61)  15.960 ms  15.986 ms  16.010 ms
 5  bbrt-hb-1-70730203-ae15.ewe-ip-backbone.de (85.16.249.103)  25.920 ms  25.982 ms  26.031 ms
 6  bbrt-hb-2-70730201-ae24.ewe-ip-backbone.de (80.228.90.30)  19.972 ms  5.301 ms  14.076 ms
 7  et-0-0-53.edge1.Hamburg1.Level3.net (62.67.25.41)  14.100 ms  14.124 ms  14.147 ms
 8  ae2.3201.ear1.London1.level3.net (4.69.141.66)  24.178 ms  25.188 ms  25.958 ms
 9  JANET.ear1.London1.Level3.net (212.187.216.254)  26.599 ms  27.495 ms  28.196 ms
10  ae24.londhx-sbr1.ja.net (146.97.35.197)  29.310 ms  30.612 ms  20.051 ms
11  ae29.londpg-sbr2.ja.net (146.97.33.2)  28.368 ms  28.394 ms  28.418 ms
12  ae31.erdiss-sbr2.ja.net (146.97.33.22)  24.889 ms  26.494 ms  26.519 ms
13  * * *
14  ral-r26.ja.net (146.97.41.34)  30.549 ms  31.887 ms  32.385 ms
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *
saenger@Sanger-sein-Mint:~$ 


And some messages from BOINC:
Mi 04 Jan 2023 06:30:41 CET | climateprediction.net | [file_xfer] http op done; retval -184 (transient HTTP error)
Mi 04 Jan 2023 06:30:41 CET | climateprediction.net | [file_xfer] file transfer status -184 (transient HTTP error)
Mi 04 Jan 2023 06:30:41 CET | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0219_1999050100_123_968_12184863_0_r1453880002_3.zip: transient HTTP error
Mi 04 Jan 2023 06:30:41 CET | climateprediction.net | Backing off 00:02:59 on upload of oifs_43r3_ps_0219_1999050100_123_968_12184863_0_r1453880002_3.zip
Mi 04 Jan 2023 06:30:41 CET |  | [http_xfer] [ID#0] HTTP: wrote 2147 bytes
Mi 04 Jan 2023 06:30:41 CET |  | [http_xfer] [ID#0] HTTP: wrote 3223 bytes
Mi 04 Jan 2023 06:30:41 CET |  | [http_xfer] [ID#0] HTTP: wrote 3456 bytes
Mi 04 Jan 2023 06:30:41 CET |  | [http_xfer] [ID#0] HTTP: wrote 3401 bytes
Mi 04 Jan 2023 06:30:41 CET |  | [http_xfer] [ID#0] HTTP: wrote 589 bytes
Mi 04 Jan 2023 06:30:41 CET |  | [http_xfer] [ID#0] HTTP: wrote 1031 bytes
Mi 04 Jan 2023 06:30:42 CET |  | Internet access OK - project servers may be temporarily down.


Some make it, but I fail to see any pattern, why they manage to get away.
Grüße vom Sänger
ID: 67274 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 67277 - Posted: 4 Jan 2023, 8:25:07 UTC - in response to Message 67274.  

And some messages from BOINC:
I would recommend that you set http_debug, rather than http_xfer_debug. The output gives more detail about the reasons for errors.
ID: 67277 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67291 - Posted: 4 Jan 2023, 11:05:25 UTC

From Andy
Update received from JASMIN Support this morning:

We believe that there is still a problem with the block storage device. We are migrating VMs off that device while we investigate the problem. I'll let you know when we have migrated the VMs.

ID: 67291 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67307 - Posted: 4 Jan 2023, 14:13:24 UTC - in response to Message 67291.  
Last modified: 4 Jan 2023, 14:20:16 UTC

I'll let you know when we have migrated the VMs.
This involves transfer of several TB of data so may be a while. This is also why we have gone back to "connect failed" from transient http error.
ID: 67307 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67310 - Posted: 4 Jan 2023, 15:22:13 UTC
Last modified: 4 Jan 2023, 15:24:05 UTC

Joy! 2 uploads are now going. Doesn't really make any difference that more aren't as my upload bandwidth is saturated.
Edit: my configured maximum of four are now going.
ID: 67310 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67313 - Posted: 4 Jan 2023, 15:51:11 UTC

And now the messages flying about are how to keep it running!
ID: 67313 · Report as offensive     Reply Quote
ProfileSaenger
Avatar

Send message
Joined: 1 Nov 04
Posts: 185
Credit: 4,166,063
RAC: 857
Message 67314 - Posted: 4 Jan 2023, 16:13:09 UTC - in response to Message 67277.  

And some messages from BOINC:
I would recommend that you set http_debug, rather than http_xfer_debug. The output gives more detail about the reasons for errors.

I did so, but I can't post any errors any more, because all is running fine once I hit the "Try again" once more after returning from work.
Whatever you did: Thanks!
Grüße vom Sänger
ID: 67314 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 67315 - Posted: 4 Jan 2023, 16:13:37 UTC

Its looking hopeful....... one just uploaded. 400+ to go.
ID: 67315 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67316 - Posted: 4 Jan 2023, 16:21:13 UTC - in response to Message 67315.  
Last modified: 4 Jan 2023, 17:08:34 UTC

They sure have enough bandwidth. I am watching the transfers with my system monitor and I estimate I am uploading at a rate of 5 Megabytes/second and have hit 11 Megabytes per second from time-to-time. I hope it keeps up.

Edit 1: I spoke too soon. Transfers stopped, retry in about 10 minutes. It tried again and failed to transfer any more. It is now resting for about another half hour.

Edit 2: It tried again, but failed. Backing off another hour.
ID: 67316 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 67318 - Posted: 4 Jan 2023, 16:35:34 UTC

I am still getting transient HTTP error. Noting has uploaded yet.
ID: 67318 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67319 - Posted: 4 Jan 2023, 17:07:51 UTC - in response to Message 67318.  

I am still getting transient HTTP error. Noting has uploaded yet.
I have six files uploading currently, 4 on host machine and 2 on vm, the maximum the respective installations of BOINC are configured for. I susupect the transient errors at the moment are just down to the server getting hammered as several TB worth of files try and upload. I don't know what the maximum number of connections the server can cope with at once is?
ID: 67319 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,006,146
RAC: 68,974
Message 67320 - Posted: 4 Jan 2023, 17:16:52 UTC - in response to Message 67319.  

Seems like each upload opens up a new connection, which isn't ideal... Then usually the folks with long RTT gets screwed more until some bandwidth frees up on the server. I just hope the server stays up so once European friends catches up mostly, we on other continents can have our uploads too...
ID: 67320 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 67321 - Posted: 4 Jan 2023, 17:27:16 UTC

I've got at least 4 connections worth streaming up at the moment. I've noticed that it takes a while for them to get going, but once they're going, things seem to continue flowing more or less smoothly. Or, at least, as smoothly as they can flow through space (all my stuff is going through Starlink right now). I've got enough sun to light up a few more boxes and they'll be uploading on a terrestrial link, hopefully...
ID: 67321 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 67322 - Posted: 4 Jan 2023, 17:28:33 UTC

The admins should plan for enough infrastructure to handle:
* "Computers with recent credit" as per the server status page. Right now, that number is 968 computers.
* With project backoff near 1 hour, that means: 16 uploads per minute average
* Total file size 224MB each model, means server needs to handle 3.5G per minute during peak time.
ID: 67322 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 67323 - Posted: 4 Jan 2023, 17:39:58 UTC - in response to Message 67322.  

The admins should plan for enough infrastructure to handle:
* "Computers with recent credit" as per the server status page. Right now, that number is 968 computers.
* With project backoff near 1 hour, that means: 16 uploads per minute average
* Total file size 224MB each model, means server needs to handle 3.5G per minute during peak time.
They did plan accordingly, they've been doing this for a long time. The upload server has about 25Tb of storage, completed tasks are processed and then move on to another big 'transfer' server. It can easily handle the load. Yesterday when it was up they were processing 50,000 requests per hour. CPDN was let down by the infrastructure on JASMIN, or perhaps their support advice. I'll find out more at their next technical meeting.
ID: 67323 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 67332 - Posted: 4 Jan 2023, 20:12:22 UTC - in response to Message 67323.  

In readiness for that technical meeting, it might be worth having a peek at https://github.com/BOINC/boinc/issues/139. That's a conversation initiated by CPDN volunteers some 16 years ago, as the original source https://boinc.berkeley.edu/trac/ticket/139 makes clear.

Uploads, and servers filling up or failing, were a problem then. too. It became apparent then, and again this last fortnight, that volunteers have very little in the way of controls which could ease that recovery. Basically, uploads are either "off" or "on", for all projects together in that boinc instance. About the only tunable parameter is

<max_file_xfers_per_project>N</max_file_xfers_per_project>
Maximum number of simultaneous file transfers per project (default 2).

And it's not enough.
ID: 67332 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 cpdn.org