climateprediction.net home page
NZ25 file upload server problems?

NZ25 file upload server problems?

Message boards : Number crunching : NZ25 file upload server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,663,251
RAC: 14,512
Message 65834 - Posted: 13 Aug 2022, 13:35:18 UTC

Three more failed at 100%. all file sizes matched size in event log message, two from testing, one main site.
ID: 65834 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,663,251
RAC: 14,512
Message 65835 - Posted: 13 Aug 2022, 20:43:28 UTC

All but one of the sixteen zips I had queued have finished uploading. The two testing tasks have reported.
ID: 65835 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 65836 - Posted: 14 Aug 2022, 1:52:17 UTC

There were power outages in the Hobart region, which may have impacted the data center.
Flood warnings as heavy rains fall across Tasmania
ID: 65836 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,584,771
RAC: 15,932
Message 65838 - Posted: 14 Aug 2022, 14:06:32 UTC

Here goes. I caught an intermediate upload from another model type, via 'network suspended'.

I can see (in the project directory):
hadsm4_a0d7_201310_6_935_012147237_0_r883230677_5.zip - size 107532550 bytes
trickle_up_hadsm4_a0d7_201310_6_935_012147237_0_1660484361.xml - size 189 bytes

The first of those appears in BOINC's transfers tab, the second doesn't.

So, set some flags and let 'em run...

Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | Started upload of hadsm4_a0d7_201310_6_935_012147237_0_r883230677_5.zip
Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | [http] [ID#16951] Info: Trying 192.171.139.103:80...
Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | [http] [ID#16951] Info: Connected to upload11.cpdn.org (192.171.139.103) port 80 (#14569)
Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | [http] [ID#16951] Sent header to server: Host: upload11.cpdn.org
Sun 14 Aug 2022 14:52:05 BST | climateprediction.net | [http] [ID#16951] Received header from server: HTTP/1.1 200 OK
Sun 14 Aug 2022 14:52:06 BST | climateprediction.net | [http] [ID#16951] Sent header to server: Content-Length: 107533027
Sun 14 Aug 2022 14:52:06 BST | climateprediction.net | [http] [ID#16951] Received header from server: HTTP/1.1 100 Continue

Sun 14 Aug 2022 14:52:11 BST | climateprediction.net | [sched_op] Starting scheduler request
Sun 14 Aug 2022 14:52:12 BST | climateprediction.net | Sending scheduler request: To send trickle-up message.
Sun 14 Aug 2022 14:52:12 BST | climateprediction.net | [http] HTTP_OP::init_post(): https://www.cpdn.org/cpdnboinc_cgi/cgi
Sun 14 Aug 2022 14:52:12 BST | climateprediction.net | [http] [ID#1] Info: Trying 129.67.193.7:443...

So, two different operations (upload and report), to two different servers/IP addresses. And the upload does start by telling the server a "Content-Length", as I thought.

That's a very selective extract from the event log, but I hope it gives you something to look for.
ID: 65838 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,663,251
RAC: 14,512
Message 65843 - Posted: 16 Aug 2022, 10:28:37 UTC

Anyone else still getting nternet acess OK, Project servers may be temporarily down?Getting this on a 2.zip from the NZ batch. I will see what happens with the 6.zip, the others generated all went.
ID: 65843 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 314
Credit: 14,559,045
RAC: 18,367
Message 65845 - Posted: 16 Aug 2022, 20:19:43 UTC - in response to Message 65843.  

Yes, have 2 trickles from this last batch that are having issues uploading with those messages.
ID: 65845 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 490
Credit: 30,767,772
RAC: 10,797
Message 65846 - Posted: 16 Aug 2022, 22:15:08 UTC - in response to Message 65843.  

Yes as of 19:11 UTC today.
ID: 65846 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 490
Credit: 30,767,772
RAC: 10,797
Message 65848 - Posted: 17 Aug 2022, 13:51:21 UTC - in response to Message 65843.  

Still getting HTTP error at 14:50 this afternoon.
ID: 65848 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,663,251
RAC: 14,512
Message 65849 - Posted: 17 Aug 2022, 14:47:18 UTC - in response to Message 65848.  

Yep, one more queued making it four for me now. also not seen this in log before at the end,

17/08/2022 15:43:30 | climateprediction.net | [http] [ID#16] Info:  We are completely uploaded and fine
ID: 65849 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,584,771
RAC: 15,932
Message 65850 - Posted: 17 Aug 2022, 15:10:47 UTC - in response to Message 65849.  

I've noticed the phrase "We are completely uploaded and fine" before, but strangely, it can't be found anywhere in the BOINC codebase. The only place it's found is in event logs quoted in https://github.com/BOINC/boinc/issues/4572, an issue about 'Uploads Stopping for Projects with Large Files' from November last year. He could have been talking about us, but it was another project.

I'm wondering if 'We are completely uploaded and fine' is a message being passed on from curl, BOINC's communications toolbox, which would make it much harder to track down.

I've always assumed that the real meaning was that BOINC had passed everything into a buffer being handled by somebody else (curl?), but didn't necessarily imply that the whole file had been acknowledged by the end user on the other side of the world. In which case, it's a badly-written message.
ID: 65850 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 65860 - Posted: 18 Aug 2022, 20:55:09 UTC - in response to Message 65850.  

I've noticed the phrase "We are completely uploaded and fine" before, but strangely, it can't be found anywhere in the BOINC codebase. The only place it's found is in event logs quoted in https://github.com/BOINC/boinc/issues/4572, an issue about 'Uploads Stopping for Projects with Large Files' from November last year. He could have been talking about us, but it was another project.

I'm wondering if 'We are completely uploaded and fine' is a message being passed on from curl, BOINC's communications toolbox, which would make it much harder to track down.

I've always assumed that the real meaning was that BOINC had passed everything into a buffer being handled by somebody else (curl?), but didn't necessarily imply that the whole file had been acknowledged by the end user on the other side of the world. In which case, it's a badly-written message.

That message is generated by curl on the successful completion of a request, but it's generated for every individual HTTP message. The file transfer sequence for CPDN has the following sequence:

    1. An initial negotiation to determine how much of the file the server has already received (file_xfer_debug outputs the line "[fxd] starting upload, upload_offset -1").
    2. There's an "Info: We are completely uploaded and fine" http_debug message when that request has been sent.
    3. The server normally responds with the number of bytes it has already received (normally a line "[fxd] starting upload, upload_offset 0"). In the messages below, the first attempt resulted in a gateway timeout.
    4. The upload then starts with a request indicating the number of bytes to be sent (the line "18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 90454919" in the messages below).
    5. This also generates an "Info: We are completely uploaded and fine" http_debug message, even when the update has failed (as was the case in the messages below).


18-Aug-2022 20:14:27 [climateprediction.net] [fxd] starting upload, upload_offset -1
18-Aug-2022 20:14:27 [climateprediction.net] Started upload of wah2_nz25_a1t1_200105_25_936_012152103_0_r1523319327_20.zip (86.28 MB)
18-Aug-2022 20:14:27 [climateprediction.net] [file_xfer] URL: http://upload4.cpdn.org/cgi-bin/file_upload_handler
18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Info:  Connected to upload4.cpdn.org (131.217.169.79) port 80 (#2295)
18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Sent header to server: Content-Length: 312
18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Info:  We are completely uploaded and fine

18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: HTTP/1.1 504 Gateway Timeout
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <html><head>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <title>504 Gateway Timeout</title>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: </head><body>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <h1>Gateway Timeout</h1>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <p>The gateway did not receive a timely response
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: from the upstream server or application.</p>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <hr>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <address>Apache/2.4.7 (Ubuntu) Server at upload4.cpdn.org Port 80</address>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: </body></html>
18-Aug-2022 20:24:28 [---] [http_xfer] [ID#26765] HTTP: wrote 328 bytes
18-Aug-2022 20:24:29 [climateprediction.net] [file_xfer] http op done; retval -184 (transient HTTP error)
18-Aug-2022 20:24:29 [climateprediction.net] [file_xfer] file transfer status -184 (transient HTTP error)
18-Aug-2022 20:24:29 [climateprediction.net] Temporarily failed upload of wah2_nz25_a1t1_200105_25_936_012152103_0_r1523319327_20.zip: transient HTTP error
18-Aug-2022 20:24:29 [climateprediction.net] Backing off 00:07:43 on upload of wah2_nz25_a1t1_200105_25_936_012152103_0_r1523319327_20.zip

18-Aug-2022 20:24:29 [climateprediction.net] [fxd] starting upload, upload_offset -1
18-Aug-2022 20:24:29 [climateprediction.net] Started upload of wah2_nz25_a07q_198705_25_936_012150040_1_r1004064792_24.zip (86.26 MB)
18-Aug-2022 20:24:29 [climateprediction.net] [file_xfer] URL: http://upload4.cpdn.org/cgi-bin/file_upload_handler
18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Info:  Connected to upload4.cpdn.org (131.217.169.79) port 80 (#2295)
18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 312
18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Info:  We are completely uploaded and fine
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  Connection died, retrying a fresh connect
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  the ioctl callback returned 0
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  Closing connection 2295
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  Issue another request to this URL: 'http://upload4.cpdn.org/cgi-bin/file_upload_handler'
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:    Trying 131.217.169.79...
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  Connected to upload4.cpdn.org (131.217.169.79) port 80 (#2302)
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 312
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  We are completely uploaded and fine

18-Aug-2022 20:26:27 [climateprediction.net] [http] [ID#26786] Received header from server: HTTP/1.1 200 OK
18-Aug-2022 20:26:27 [climateprediction.net] [file_xfer] http op done; retval 0 (Success)
18-Aug-2022 20:26:27 [climateprediction.net] [file_xfer] parsing upload response: <data_server_reply>
    <status>0</status>
    <file_size>0</file_size>
</data_server_reply>

18-Aug-2022 20:26:27 [climateprediction.net] [file_xfer] parsing status: 0
18-Aug-2022 20:26:27 [climateprediction.net] [fxd] starting upload, upload_offset 0
18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 90454919
18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: Expect: 100-continue
18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Received header from server: HTTP/1.1 100 Continue

18-Aug-2022 20:35:58 [climateprediction.net] [http] [ID#26786] Info:  We are completely uploaded and fine

18-Aug-2022 20:38:26 [climateprediction.net] [http] [ID#26786] Received header from server: HTTP/1.1 200 OK
18-Aug-2022 20:38:26 [climateprediction.net] [http] [ID#26786] Received header from server: Content-Length: 123
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] http op done; retval 0 (Success)
18-Aug-2022 20:38:27 [climateprediction.net] Error reported by file upload server: EOF on socket read : asked for 262144, got 133564
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] parsing upload response: <data_server_reply>
    <status>1</status>
    <message>EOF on socket read : asked for 262144, got 133564
</message>
</data_server_reply>
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] parsing status: -127
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] file transfer status -127 (transient upload error)
18-Aug-2022 20:38:27 [climateprediction.net] Temporarily failed upload of wah2_nz25_a07q_198705_25_936_012150040_1_r1004064792_24.zip: transient upload error
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] project-wide xfer delay for 645.092929 sec
18-Aug-2022 20:38:27 [climateprediction.net] Backing off 00:02:09 on upload of wah2_nz25_a07q_198705_25_936_012150040_1_r1004064792_24.zip

"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 65860 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,663,251
RAC: 14,512
Message 65861 - Posted: 18 Aug 2022, 21:09:24 UTC - in response to Message 65860.  

Thanks for that Thyme. Not sure why I never noticed it before though.
ID: 65861 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,584,771
RAC: 15,932
Message 65873 - Posted: 19 Aug 2022, 7:44:24 UTC - in response to Message 65860.  

Two more points from Thyme's post.

1) The log entries for HTTP/1.1 504 Gateway Timeout, followed by "The gateway did not receive a timely response from the upstream server or application."

My take on that is that it's an internal problem within the receiving institution or data centre. They have one ordinary server running the BOINC programs, but with limited local disk storage directly connected to that computer. And they have a second specialist high capacity disk storage array. The 'gateway' is the network connection between the two devices. Sometimes it fails, probably because of a problem with the disk storage array.

There's absolutely nothing we can do about that - it's way outside our control. I would expect the local administration team to be fully aware of the situation, but it might just be worth Oxford mentioning this message to them, just to help ensure they concentrate on the right device.

2) Error reported by file upload server: EOF on socket read : asked for 262144, got 133564

I've been puzzling over that one. Most of the concern has been over the intermediate .zip data files, which are typically tens of megabytes or even a hundred megabytes in size. So why is the upload server quibbling over a mere hundred kilobytes in a file containing just a quarter of a megabyte?

Thyme has drawn attention to the restart procedure for a file transfer that glitches on the long overseas leg of the journey. The server says how much it's received already, and the BOINC client restarts the transfer from that point, and tells the server how much is left. I'm wondering if there's a BOINC bug in calculating the size of that remaining fragment, when comparing two very large numbers?
ID: 65873 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,663,251
RAC: 14,512
Message 65877 - Posted: 19 Aug 2022, 9:40:53 UTC

Thanks Richard, I have linked our post below on the Trello board for Andy to nudge the data centre or whoever in NZ deals with them.
ID: 65877 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,663,251
RAC: 14,512
Message 65883 - Posted: 19 Aug 2022, 15:22:15 UTC

Somehow managed to sneak another two zips onto the server. five queued at the moment.
ID: 65883 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 65885 - Posted: 19 Aug 2022, 16:27:09 UTC - in response to Message 65873.  

2) Error reported by file upload server: EOF on socket read : asked for 262144, got 133564

I've been puzzling over that one. Most of the concern has been over the intermediate .zip data files, which are typically tens of megabytes or even a hundred megabytes in size. So why is the upload server quibbling over a mere hundred kilobytes in a file containing just a quarter of a megabyte?

Nothing to puzzle over Richard. The transfer process transparently splits large uploads into smaller packets. The transfer uses 4 different communication layers, with the relevant ones being 2 to 4 (layer 1 is the physical wire):

    2. The maximum packet size for the Ethernet layer is 1518 bytes but, in practice, the packet size is typically between 1476 and 1500 bytes.
    3. The maximum packet size for the IPv4 TCP layer is 64kb.
    4. The maximum packet size for the BOINC data transfer layer is 256kb, which is where the 262144 comes from.

The socket read error is saying that the upload server received just over half of an expected 256kb file segment before the upload failed (130kb (plus 444 bytes) were in the received packets).


"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 65885 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,663,251
RAC: 14,512
Message 65886 - Posted: 19 Aug 2022, 16:43:25 UTC - in response to Message 65885.  

So, the server hiccoughs and the upload falls over. I just got one more to go through, so maybe something is changing or maybe it is just luck?
ID: 65886 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 65887 - Posted: 19 Aug 2022, 16:58:24 UTC - in response to Message 65883.  
Last modified: 19 Aug 2022, 17:01:44 UTC

Somehow managed to sneak another two zips onto the server. five queued at the moment.

I had 18 files queued up, with all but one uploaded in a 2-hour window starting at 1424 UTC (average upload time slightly under 4 minutes). The last file was backed off and successfully uploaded on a manual retry 15 minutes after the 17th upload was completed, but it took significantly longer than the others (10 minutes).
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 65887 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,663,251
RAC: 14,512
Message 65888 - Posted: 19 Aug 2022, 17:31:04 UTC - in response to Message 65887.  
Last modified: 19 Aug 2022, 17:51:58 UTC

I had 18 files queued up, with all but one uploaded in a 2-hour window starting at 1424 UTC (average upload time slightly under 4 minutes). The last file was backed off and successfully uploaded on a manual retry 15 minutes after the 17th upload was completed, but it took significantly longer than the others (10 minutes).


I suspect that having a slow connection my end gives more opportunities for things to go wrong.

Edit: I also wonder if limiting uploads to one at a time might help?
ID: 65888 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 314
Credit: 14,559,045
RAC: 18,367
Message 65893 - Posted: 19 Aug 2022, 19:48:49 UTC

Things seem to be uploading now. It took several minutes but I just got the last 2 files uploaded and thus was able to report a completed task. Hit retry upload while things are working.
ID: 65893 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : NZ25 file upload server problems?

©2024 cpdn.org