Message boards :
Number crunching :
NZ25 file upload server problems?
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
Three more failed at 100%. all file sizes matched size in event log message, two from testing, one main site. |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
All but one of the sixteen zips I had queued have finished uploading. The two testing tasks have reported. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There were power outages in the Hobart region, which may have impacted the data center. Flood warnings as heavy rains fall across Tasmania |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,584,771 RAC: 15,932 |
Here goes. I caught an intermediate upload from another model type, via 'network suspended'. I can see (in the project directory): hadsm4_a0d7_201310_6_935_012147237_0_r883230677_5.zip - size 107532550 bytes trickle_up_hadsm4_a0d7_201310_6_935_012147237_0_1660484361.xml - size 189 bytes The first of those appears in BOINC's transfers tab, the second doesn't. So, set some flags and let 'em run... Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | Started upload of hadsm4_a0d7_201310_6_935_012147237_0_r883230677_5.zip Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | [http] [ID#16951] Info: Trying 192.171.139.103:80... Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | [http] [ID#16951] Info: Connected to upload11.cpdn.org (192.171.139.103) port 80 (#14569) Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | [http] [ID#16951] Sent header to server: Host: upload11.cpdn.org Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | [http] [ID#16951] Received header from server: HTTP/1.1 200 OK Sun 14 Aug 2022 14:52:06 BST | climateprediction.net | [http] [ID#16951] Sent header to server: Content-Length: 107533027 Sun 14 Aug 2022 14:52:06 BST | climateprediction.net | [http] [ID#16951] Received header from server: HTTP/1.1 100 Continue Sun 14 Aug 2022 14:52:11 BST | climateprediction.net | [sched_op] Starting scheduler request Sun 14 Aug 2022 14:52:12 BST | climateprediction.net | Sending scheduler request: To send trickle-up message. Sun 14 Aug 2022 14:52:12 BST | climateprediction.net | [http] HTTP_OP::init_post(): https://www.cpdn.org/cpdnboinc_cgi/cgi Sun 14 Aug 2022 14:52:12 BST | climateprediction.net | [http] [ID#1] Info: Trying 129.67.193.7:443... So, two different operations (upload and report), to two different servers/IP addresses. And the upload does start by telling the server a "Content-Length", as I thought. That's a very selective extract from the event log, but I hope it gives you something to look for. |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
Anyone else still getting nternet acess OK, Project servers may be temporarily down?Getting this on a 2.zip from the NZ batch. I will see what happens with the 6.zip, the others generated all went. |
Send message Joined: 12 Apr 21 Posts: 314 Credit: 14,559,045 RAC: 18,367 |
Yes, have 2 trickles from this last batch that are having issues uploading with those messages. |
Send message Joined: 22 Feb 06 Posts: 490 Credit: 30,767,772 RAC: 10,797 |
Yes as of 19:11 UTC today. |
Send message Joined: 22 Feb 06 Posts: 490 Credit: 30,767,772 RAC: 10,797 |
Still getting HTTP error at 14:50 this afternoon. |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
Yep, one more queued making it four for me now. also not seen this in log before at the end, 17/08/2022 15:43:30 | climateprediction.net | [http] [ID#16] Info: We are completely uploaded and fine |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,584,771 RAC: 15,932 |
I've noticed the phrase "We are completely uploaded and fine" before, but strangely, it can't be found anywhere in the BOINC codebase. The only place it's found is in event logs quoted in https://github.com/BOINC/boinc/issues/4572, an issue about 'Uploads Stopping for Projects with Large Files' from November last year. He could have been talking about us, but it was another project. I'm wondering if 'We are completely uploaded and fine' is a message being passed on from curl, BOINC's communications toolbox, which would make it much harder to track down. I've always assumed that the real meaning was that BOINC had passed everything into a buffer being handled by somebody else (curl?), but didn't necessarily imply that the whole file had been acknowledged by the end user on the other side of the world. In which case, it's a badly-written message. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
I've noticed the phrase "We are completely uploaded and fine" before, but strangely, it can't be found anywhere in the BOINC codebase. The only place it's found is in event logs quoted in https://github.com/BOINC/boinc/issues/4572, an issue about 'Uploads Stopping for Projects with Large Files' from November last year. He could have been talking about us, but it was another project. That message is generated by curl on the successful completion of a request, but it's generated for every individual HTTP message. The file transfer sequence for CPDN has the following sequence: 1. An initial negotiation to determine how much of the file the server has already received (file_xfer_debug outputs the line "[fxd] starting upload, upload_offset -1"). 2. There's an "Info: We are completely uploaded and fine" http_debug message when that request has been sent. 3. The server normally responds with the number of bytes it has already received (normally a line "[fxd] starting upload, upload_offset 0"). In the messages below, the first attempt resulted in a gateway timeout. 4. The upload then starts with a request indicating the number of bytes to be sent (the line "18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 90454919" in the messages below). 5. This also generates an "Info: We are completely uploaded and fine" http_debug message, even when the update has failed (as was the case in the messages below).
18-Aug-2022 20:14:27 [climateprediction.net] [fxd] starting upload, upload_offset -1 18-Aug-2022 20:14:27 [climateprediction.net] Started upload of wah2_nz25_a1t1_200105_25_936_012152103_0_r1523319327_20.zip (86.28 MB) 18-Aug-2022 20:14:27 [climateprediction.net] [file_xfer] URL: http://upload4.cpdn.org/cgi-bin/file_upload_handler 18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Info: Connected to upload4.cpdn.org (131.217.169.79) port 80 (#2295) 18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1 18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Sent header to server: Content-Length: 312 18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Info: We are completely uploaded and fine 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: HTTP/1.1 504 Gateway Timeout 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <html><head> 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <title>504 Gateway Timeout</title> 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: </head><body> 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <h1>Gateway Timeout</h1> 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <p>The gateway did not receive a timely response 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: from the upstream server or application.</p> 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <hr> 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <address>Apache/2.4.7 (Ubuntu) Server at upload4.cpdn.org Port 80</address> 18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: </body></html> 18-Aug-2022 20:24:28 [---] [http_xfer] [ID#26765] HTTP: wrote 328 bytes 18-Aug-2022 20:24:29 [climateprediction.net] [file_xfer] http op done; retval -184 (transient HTTP error) 18-Aug-2022 20:24:29 [climateprediction.net] [file_xfer] file transfer status -184 (transient HTTP error) 18-Aug-2022 20:24:29 [climateprediction.net] Temporarily failed upload of wah2_nz25_a1t1_200105_25_936_012152103_0_r1523319327_20.zip: transient HTTP error 18-Aug-2022 20:24:29 [climateprediction.net] Backing off 00:07:43 on upload of wah2_nz25_a1t1_200105_25_936_012152103_0_r1523319327_20.zip 18-Aug-2022 20:24:29 [climateprediction.net] [fxd] starting upload, upload_offset -1 18-Aug-2022 20:24:29 [climateprediction.net] Started upload of wah2_nz25_a07q_198705_25_936_012150040_1_r1004064792_24.zip (86.26 MB) 18-Aug-2022 20:24:29 [climateprediction.net] [file_xfer] URL: http://upload4.cpdn.org/cgi-bin/file_upload_handler 18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Info: Connected to upload4.cpdn.org (131.217.169.79) port 80 (#2295) 18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1 18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 312 18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Info: We are completely uploaded and fine 18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info: Connection died, retrying a fresh connect 18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info: the ioctl callback returned 0 18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info: Closing connection 2295 18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info: Issue another request to this URL: 'http://upload4.cpdn.org/cgi-bin/file_upload_handler' 18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info: Trying 131.217.169.79... 18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info: Connected to upload4.cpdn.org (131.217.169.79) port 80 (#2302) 18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1 18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 312 18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info: We are completely uploaded and fine 18-Aug-2022 20:26:27 [climateprediction.net] [http] [ID#26786] Received header from server: HTTP/1.1 200 OK 18-Aug-2022 20:26:27 [climateprediction.net] [file_xfer] http op done; retval 0 (Success) 18-Aug-2022 20:26:27 [climateprediction.net] [file_xfer] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size> </data_server_reply> 18-Aug-2022 20:26:27 [climateprediction.net] [file_xfer] parsing status: 0 18-Aug-2022 20:26:27 [climateprediction.net] [fxd] starting upload, upload_offset 0 18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1 18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 90454919 18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: Expect: 100-continue 18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Received header from server: HTTP/1.1 100 Continue 18-Aug-2022 20:35:58 [climateprediction.net] [http] [ID#26786] Info: We are completely uploaded and fine 18-Aug-2022 20:38:26 [climateprediction.net] [http] [ID#26786] Received header from server: HTTP/1.1 200 OK 18-Aug-2022 20:38:26 [climateprediction.net] [http] [ID#26786] Received header from server: Content-Length: 123 18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] http op done; retval 0 (Success) 18-Aug-2022 20:38:27 [climateprediction.net] Error reported by file upload server: EOF on socket read : asked for 262144, got 133564 18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] parsing upload response: <data_server_reply> <status>1</status> <message>EOF on socket read : asked for 262144, got 133564 </message> </data_server_reply> 18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] parsing status: -127 18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] file transfer status -127 (transient upload error) 18-Aug-2022 20:38:27 [climateprediction.net] Temporarily failed upload of wah2_nz25_a07q_198705_25_936_012150040_1_r1004064792_24.zip: transient upload error 18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] project-wide xfer delay for 645.092929 sec 18-Aug-2022 20:38:27 [climateprediction.net] Backing off 00:02:09 on upload of wah2_nz25_a07q_198705_25_936_012150040_1_r1004064792_24.zip "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
Thanks for that Thyme. Not sure why I never noticed it before though. |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,584,771 RAC: 15,932 |
Two more points from Thyme's post. 1) The log entries for HTTP/1.1 504 Gateway Timeout, followed by "The gateway did not receive a timely response from the upstream server or application." My take on that is that it's an internal problem within the receiving institution or data centre. They have one ordinary server running the BOINC programs, but with limited local disk storage directly connected to that computer. And they have a second specialist high capacity disk storage array. The 'gateway' is the network connection between the two devices. Sometimes it fails, probably because of a problem with the disk storage array. There's absolutely nothing we can do about that - it's way outside our control. I would expect the local administration team to be fully aware of the situation, but it might just be worth Oxford mentioning this message to them, just to help ensure they concentrate on the right device. 2) Error reported by file upload server: EOF on socket read : asked for 262144, got 133564 I've been puzzling over that one. Most of the concern has been over the intermediate .zip data files, which are typically tens of megabytes or even a hundred megabytes in size. So why is the upload server quibbling over a mere hundred kilobytes in a file containing just a quarter of a megabyte? Thyme has drawn attention to the restart procedure for a file transfer that glitches on the long overseas leg of the journey. The server says how much it's received already, and the BOINC client restarts the transfer from that point, and tells the server how much is left. I'm wondering if there's a BOINC bug in calculating the size of that remaining fragment, when comparing two very large numbers? |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
Thanks Richard, I have linked our post below on the Trello board for Andy to nudge the data centre or whoever in NZ deals with them. |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
Somehow managed to sneak another two zips onto the server. five queued at the moment. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
2) Error reported by file upload server: EOF on socket read : asked for 262144, got 133564 Nothing to puzzle over Richard. The transfer process transparently splits large uploads into smaller packets. The transfer uses 4 different communication layers, with the relevant ones being 2 to 4 (layer 1 is the physical wire): 2. The maximum packet size for the Ethernet layer is 1518 bytes but, in practice, the packet size is typically between 1476 and 1500 bytes. 3. The maximum packet size for the IPv4 TCP layer is 64kb. 4. The maximum packet size for the BOINC data transfer layer is 256kb, which is where the 262144 comes from. The socket read error is saying that the upload server received just over half of an expected 256kb file segment before the upload failed (130kb (plus 444 bytes) were in the received packets). "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
So, the server hiccoughs and the upload falls over. I just got one more to go through, so maybe something is changing or maybe it is just luck? |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Somehow managed to sneak another two zips onto the server. five queued at the moment. I had 18 files queued up, with all but one uploaded in a 2-hour window starting at 1424 UTC (average upload time slightly under 4 minutes). The last file was backed off and successfully uploaded on a manual retry 15 minutes after the 17th upload was completed, but it took significantly longer than the others (10 minutes). "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
I had 18 files queued up, with all but one uploaded in a 2-hour window starting at 1424 UTC (average upload time slightly under 4 minutes). The last file was backed off and successfully uploaded on a manual retry 15 minutes after the 17th upload was completed, but it took significantly longer than the others (10 minutes). I suspect that having a slow connection my end gives more opportunities for things to go wrong. Edit: I also wonder if limiting uploads to one at a time might help? |
Send message Joined: 12 Apr 21 Posts: 314 Credit: 14,559,045 RAC: 18,367 |
Things seem to be uploading now. It took several minutes but I just got the last 2 files uploaded and thus was able to report a completed task. Hit retry upload while things are working. |
©2024 cpdn.org