Thread 'The uploads are stuck'

Author	Message
Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 67260 - Posted: 3 Jan 2023, 19:13:35 UTC - in response to Message 67258. Bunch of files uploaded then it started getting http error at about 15:30. 15:50, started working again then more errors from 17:10 onwards. May just be the server being overloaded though. For me, I got two batches. The second batch had a few skips, but usually a minute or less. 10:13:32 to 10:22:37 EST so add 5 hours for UTC 11:40:17 to 12:36:02 EST so add 5 hours for UTC Since then, many tries but no successes. ID: 67260 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 67261 - Posted: 3 Jan 2023, 20:18:48 UTC Last modified: 3 Jan 2023, 20:52:38 UTC Bunch of files uploaded then it started getting http error at about 15:30. 15:50, started working again then more errors from 17:10 onwards. May just be the server being overloaded though. This posted on projects message board with folowing reply from Andy: I am afraid you are right. It's lost both it's SSH and HTTP ports again. I have sent another email to Matt in JASMIN Support to ask him to investigate this again tomorrow morning. ID: 67261 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 67264 - Posted: 3 Jan 2023, 22:05:58 UTC Open file count hitting limits and it can't create new sockets? If he could get serial console access to it, that should be independent of SSH connections, to investigate and see what's going on. Or remote syslog to something he's got access to. Troubleshooting when it "just locks up" is brutal. :( I've also seen that sort of behavior from running a machine very badly out of RAM. I've no idea how much processing is happening on the upload box vs later on (zips being extracted, etc), but with the amount of uploads flowing in, any sort of "post-processing" tasks could easily thrash the machine into oblivion - the fun of recovering from non-steady-state operation is that you now have to be able to handle all the traffic queued up. A bunch of my machines refuse to download new tasks as there are too many uploads in progress - not sure if there's a way around it. ID: 67264 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 67272 - Posted: 4 Jan 2023, 2:47:20 UTC - in response to Message 67245. traceroute eventually makes its way to the proper destination. It doesn't for me. Actually, I think it does because it seems to fail the same way it is doing for others. I just wonder why traceroute fails like this Or is DNS failing somewhere along the line? localhost:jeandavid8[~]$ date Tue Jan 3 21:17:56 EST 2023 localhost:jeandavid8[~]$ traceroute upload11.cpdn.org traceroute to upload11.cpdn.org (192.171.169.187), 30 hops max, 60 byte packets 1 Fios_Quantum_Gateway.fios-router.home (192.168.0.1) 0.317 ms 0.454 ms 0.590 ms 2 lo0-100.NWRKNJ-VFTTP-309.verizon-gni.net (71.127.205.1) 5.251 ms 8.144 ms 8.196 ms 3 at-0-0-0-1716.ALT2-CORE-RTR1.verizon-gni.net (100.41.5.68) 8.258 ms at-0-0-0-1717.ALT2-CORE-RTR2.verizon-gni.net (100.41.5.70) 10.218 ms 8.314 ms 4 0.csi1.NBWKNJNB-MSE01-BB-SU1.ALTER.NET (140.222.4.106) 10.335 ms 0.csi1.NWRKNJ02-MSE01-BB-SU1.ALTER.NET (140.222.4.104) 13.201 ms 0.csi1.NBWKNJNB-MSE01-BB-SU1.ALTER.NET (140.222.4.106) 12.735 ms 5 * * * 6 * * * 7 * * * 8 nyk-bb2-link.ip.twelve99.net (62.115.135.162) 8.699 ms * * 9 ldn-bb1-link.ip.twelve99.net (62.115.113.21) 77.718 ms 77.820 ms 77.458 ms 10 ldn-b2-link.ip.twelve99.net (62.115.122.189) 79.894 ms ldn-b2-link.ip.twelve99.net (62.115.120.239) 77.523 ms 74.854 ms 11 jisc-ic345131-ldn-b2.ip.twelve99-cust.net (62.115.175.131) 77.402 ms 77.977 ms 75.352 ms 12 ae24.londhx-sbr1.ja.net (146.97.35.197) 77.833 ms 77.979 ms 77.380 ms 13 ae29.londpg-sbr2.ja.net (146.97.33.2) 97.289 ms 95.016 ms 89.015 ms 14 ae31.erdiss-sbr2.ja.net (146.97.33.22) 80.207 ms 80.350 ms 82.997 ms 15 * * * 16 ral-r26.ja.net (146.97.41.34) 79.900 ms 81.510 ms 84.112 ms 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * * localhost:jeandavid8[~]$ ID: 67272 · Reply Quote

Saenger Send message Joined: 1 Nov 04 Posts: 185 Credit: 4,166,063 RAC: 857	Message 67274 - Posted: 4 Jan 2023, 5:36:47 UTC My traceroute: saenger@Sanger-sein-Mint:~$ traceroute upload11.cpdn.org traceroute to upload11.cpdn.org (192.171.169.187), 30 hops max, 60 byte packets 1 fritz.box (192.168.178.1) 0.546 ms 0.803 ms 1.021 ms 2 85.16.121.109 (85.16.121.109) 15.733 ms 15.779 ms 15.824 ms 3 mprt-hb-70740110-xe-2-2-0.ewe-ip-backbone.de (85.16.250.59) 15.886 ms 15.922 ms 15.946 ms 4 mprt-hb-70740101-xe-1-2-1.ewe-ip-backbone.de (85.16.250.61) 15.960 ms 15.986 ms 16.010 ms 5 bbrt-hb-1-70730203-ae15.ewe-ip-backbone.de (85.16.249.103) 25.920 ms 25.982 ms 26.031 ms 6 bbrt-hb-2-70730201-ae24.ewe-ip-backbone.de (80.228.90.30) 19.972 ms 5.301 ms 14.076 ms 7 et-0-0-53.edge1.Hamburg1.Level3.net (62.67.25.41) 14.100 ms 14.124 ms 14.147 ms 8 ae2.3201.ear1.London1.level3.net (4.69.141.66) 24.178 ms 25.188 ms 25.958 ms 9 JANET.ear1.London1.Level3.net (212.187.216.254) 26.599 ms 27.495 ms 28.196 ms 10 ae24.londhx-sbr1.ja.net (146.97.35.197) 29.310 ms 30.612 ms 20.051 ms 11 ae29.londpg-sbr2.ja.net (146.97.33.2) 28.368 ms 28.394 ms 28.418 ms 12 ae31.erdiss-sbr2.ja.net (146.97.33.22) 24.889 ms 26.494 ms 26.519 ms 13 * * * 14 ral-r26.ja.net (146.97.41.34) 30.549 ms 31.887 ms 32.385 ms 15 * * * 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * * saenger@Sanger-sein-Mint:~$ And some messages from BOINC: Mi 04 Jan 2023 06:30:41 CET \| climateprediction.net \| [file_xfer] http op done; retval -184 (transient HTTP error) Mi 04 Jan 2023 06:30:41 CET \| climateprediction.net \| [file_xfer] file transfer status -184 (transient HTTP error) Mi 04 Jan 2023 06:30:41 CET \| climateprediction.net \| Temporarily failed upload of oifs_43r3_ps_0219_1999050100_123_968_12184863_0_r1453880002_3.zip: transient HTTP error Mi 04 Jan 2023 06:30:41 CET \| climateprediction.net \| Backing off 00:02:59 on upload of oifs_43r3_ps_0219_1999050100_123_968_12184863_0_r1453880002_3.zip Mi 04 Jan 2023 06:30:41 CET \| \| [http_xfer] [ID#0] HTTP: wrote 2147 bytes Mi 04 Jan 2023 06:30:41 CET \| \| [http_xfer] [ID#0] HTTP: wrote 3223 bytes Mi 04 Jan 2023 06:30:41 CET \| \| [http_xfer] [ID#0] HTTP: wrote 3456 bytes Mi 04 Jan 2023 06:30:41 CET \| \| [http_xfer] [ID#0] HTTP: wrote 3401 bytes Mi 04 Jan 2023 06:30:41 CET \| \| [http_xfer] [ID#0] HTTP: wrote 589 bytes Mi 04 Jan 2023 06:30:41 CET \| \| [http_xfer] [ID#0] HTTP: wrote 1031 bytes Mi 04 Jan 2023 06:30:42 CET \| \| Internet access OK - project servers may be temporarily down. Some make it, but I fail to see any pattern, why they manage to get away. Grüße vom Sänger ID: 67274 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,717,389 RAC: 8,111	Message 67277 - Posted: 4 Jan 2023, 8:25:07 UTC - in response to Message 67274. And some messages from BOINC: I would recommend that you set http_debug, rather than http_xfer_debug. The output gives more detail about the reasons for errors. ID: 67277 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 67291 - Posted: 4 Jan 2023, 11:05:25 UTC From Andy Update received from JASMIN Support this morning: We believe that there is still a problem with the block storage device. We are migrating VMs off that device while we investigate the problem. I'll let you know when we have migrated the VMs. ID: 67291 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 67307 - Posted: 4 Jan 2023, 14:13:24 UTC - in response to Message 67291. Last modified: 4 Jan 2023, 14:20:16 UTC I'll let you know when we have migrated the VMs. This involves transfer of several TB of data so may be a while. This is also why we have gone back to "connect failed" from transient http error. ID: 67307 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 67310 - Posted: 4 Jan 2023, 15:22:13 UTC Last modified: 4 Jan 2023, 15:24:05 UTC Joy! 2 uploads are now going. Doesn't really make any difference that more aren't as my upload bandwidth is saturated. Edit: my configured maximum of four are now going. ID: 67310 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 67313 - Posted: 4 Jan 2023, 15:51:11 UTC And now the messages flying about are how to keep it running! ID: 67313 · Reply Quote

Saenger Send message Joined: 1 Nov 04 Posts: 185 Credit: 4,166,063 RAC: 857	Message 67314 - Posted: 4 Jan 2023, 16:13:09 UTC - in response to Message 67277. And some messages from BOINC: I would recommend that you set http_debug, rather than http_xfer_debug. The output gives more detail about the reasons for errors. I did so, but I can't post any errors any more, because all is running fine once I hit the "Try again" once more after returning from work. Whatever you did: Thanks! Grüße vom Sänger ID: 67314 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 67315 - Posted: 4 Jan 2023, 16:13:37 UTC Its looking hopeful....... one just uploaded. 400+ to go. ID: 67315 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 67316 - Posted: 4 Jan 2023, 16:21:13 UTC - in response to Message 67315. Last modified: 4 Jan 2023, 17:08:34 UTC They sure have enough bandwidth. I am watching the transfers with my system monitor and I estimate I am uploading at a rate of 5 Megabytes/second and have hit 11 Megabytes per second from time-to-time. I hope it keeps up. Edit 1: I spoke too soon. Transfers stopped, retry in about 10 minutes. It tried again and failed to transfer any more. It is now resting for about another half hour. Edit 2: It tried again, but failed. Backing off another hour. ID: 67316 · Reply Quote

zombie67 [MM] Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128	Message 67318 - Posted: 4 Jan 2023, 16:35:34 UTC I am still getting transient HTTP error. Noting has uploaded yet. ID: 67318 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 67319 - Posted: 4 Jan 2023, 17:07:51 UTC - in response to Message 67318. I am still getting transient HTTP error. Noting has uploaded yet. I have six files uploading currently, 4 on host machine and 2 on vm, the maximum the respective installations of BOINC are configured for. I susupect the transient errors at the moment are just down to the server getting hammered as several TB worth of files try and upload. I don't know what the maximum number of connections the server can cope with at once is? ID: 67319 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,996,185 RAC: 68,842	Message 67320 - Posted: 4 Jan 2023, 17:16:52 UTC - in response to Message 67319. Seems like each upload opens up a new connection, which isn't ideal... Then usually the folks with long RTT gets screwed more until some bandwidth frees up on the server. I just hope the server stays up so once European friends catches up mostly, we on other continents can have our uploads too... ID: 67320 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 67321 - Posted: 4 Jan 2023, 17:27:16 UTC I've got at least 4 connections worth streaming up at the moment. I've noticed that it takes a while for them to get going, but once they're going, things seem to continue flowing more or less smoothly. Or, at least, as smoothly as they can flow through space (all my stuff is going through Starlink right now). I've got enough sun to light up a few more boxes and they'll be uploading on a terrestrial link, hopefully... ID: 67321 · Reply Quote

DJStarfox Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370	Message 67322 - Posted: 4 Jan 2023, 17:28:33 UTC The admins should plan for enough infrastructure to handle: * "Computers with recent credit" as per the server status page. Right now, that number is 968 computers. * With project backoff near 1 hour, that means: 16 uploads per minute average * Total file size 224MB each model, means server needs to handle 3.5G per minute during peak time. ID: 67322 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681	Message 67323 - Posted: 4 Jan 2023, 17:39:58 UTC - in response to Message 67322. The admins should plan for enough infrastructure to handle: * "Computers with recent credit" as per the server status page. Right now, that number is 968 computers. * With project backoff near 1 hour, that means: 16 uploads per minute average * Total file size 224MB each model, means server needs to handle 3.5G per minute during peak time. They did plan accordingly, they've been doing this for a long time. The upload server has about 25Tb of storage, completed tasks are processed and then move on to another big 'transfer' server. It can easily handle the load. Yesterday when it was up they were processing 50,000 requests per hour. CPDN was let down by the infrastructure on JASMIN, or perhaps their support advice. I'll find out more at their next technical meeting. ID: 67323 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,717,389 RAC: 8,111	Message 67332 - Posted: 4 Jan 2023, 20:12:22 UTC - in response to Message 67323. In readiness for that technical meeting, it might be worth having a peek at https://github.com/BOINC/boinc/issues/139. That's a conversation initiated by CPDN volunteers some 16 years ago, as the original source https://boinc.berkeley.edu/trac/ticket/139 makes clear. Uploads, and servers filling up or failing, were a problem then. too. It became apparent then, and again this last fortnight, that volunteers have very little in the way of controls which could ease that recovery. Basically, uploads are either "off" or "on", for all projects together in that boinc instance. About the only tunable parameter is <max_file_xfers_per_project>N</max_file_xfers_per_project> Maximum number of simultaneous file transfers per project (default 2). And it's not enough. ID: 67332 · Reply Quote