Message boards : Number crunching : The uploads are stuck
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 25 · Next
Author | Message |
---|---|
Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,006,146 RAC: 68,974 |
In readiness for that technical meeting, it might be worth having a peek at https://github.com/BOINC/boinc/issues/139. That's a conversation initiated by CPDN volunteers some 16 years ago, as the original source https://boinc.berkeley.edu/trac/ticket/139 makes clear. I doubt this would help honestly. Given there are only a thousand or so hosts with recent credit, and each has two upload slots by default, I can hardly imaging a few thousand or even tens of thousand partially uploaded trickle files will cause such a big trouble. That's negligible compared to total data the CPDN upload servers need to handle anyway. The recovery story has to be on the server side. I doubt most volunteers ever look at the forum, let alone ready to tweak local configs all the time even if the knobs are added. I am more curious about what actually went wrong. Other than the mention of failure on storage array, it's unclear to me if the upload server run out of other resources and why so many hosts (mine included) still couldn't connect after VM migration. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
I am more curious about what actually went wrong. Other than the mention of failure on storage array, it's unclear to me if the upload server run out of other resources and why so many hosts (mine included) still couldn't connect after VM migration.The VM was moved before Christmas onto a block device of 2x12Tb. It either couldn't handle the transaction rate from the CPDN VMs or there is an issue with the underlying hardware - they are not sure which yet. So at some point in the near future they will probably transition to 4x6Tb but JASMIN are still investigating exactly what the issue(s) were. OS wise the ssh & httpd ports went down and no-one could log in (because there's no console access on this unmanaged cloud). That's all I understand at present. Richard - I'll have a read. I may not get time to bring up all the points I'd like at the next meeting. It's only an hour usually and I think there will be lots to discuss about the JASMIN setup. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
The upload server appears to be down again. I have informed CPDN. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
Still going here. Took 5.5 hrs to upload all the zips for the first w/u. Could finish the lot by sometime tomorrow evening(late) if all keeps going. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Still going here. Took 5.5 hrs to upload all the zips for the first w/u. Could finish the lot by sometime tomorrow evening(late) if all keeps going.Ok, I am getting the odd one or two through but most are failing with http transient errors. It did this before it finally died yesterday. Let's see what happens... Edit: CPDN can still log into the machine so it appears to be up ok. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The upload server appears to be down again. I have informed CPDN. It seems OK here. I have uploaded all those 'trickle(.zip)" files, so my Boinc Client has downloaded a bunch of new tasks, and five of them are running as I type this. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
It seems OK here. I have uploaded all those 'trickle(.zip)" files, so my Boinc Client has downloaded a bunch of new tasks, and five of them are running as I type this.I was uploading at 0300UCT BY 0440 I was getting transient http errors and the servers may be down message. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,718,239 RAC: 8,054 |
Somewhere about 05/01/2023 04:03:11 | climateprediction.net | Finished upload of oifs_43r3_ps_0796_2018050100_123_987_12204440_0_r206451077_11.zip(log times UTC) Edit - that upload (and three others across two machines) is showing as having reached 100% before pausing. In my experience, that can mean that an onward transfer to backing storage has failed, rather than the upload server itself. But I didn't have enough logging active to confirm that. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Yesterday I uploaded all the zips, downloaded new WUs, uploads are stuck again and one of the finished WUs got a computation error due to missing upload. I'm not sure whether zips were lost due to the failure of my own machine a week ago or some upload problems. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
Well I managed to get 2 w/u worth of zip files uploaded but it seems to have gone "pop" again. Nothing uploading. Good job those zip files are not 100+meg in size or we might never catch up. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I was uploading at 0300UCT BY 0440 I was getting transient http errors and the servers may be down message. Me too. And they do not seem to be transient. This is the first one in my Event Log, but it may have started a little earlier. (Note: times are EST.) Wed 04 Jan 2023 11:40:45 PM EST | | Project communication failed: attempting access to reference site Wed 04 Jan 2023 11:40:45 PM EST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0783_1988050100_123_957_12174427_0_r1903842791_34.zip: transient HTTP error Wed 04 Jan 2023 11:40:45 PM EST | climateprediction.net | Backing off 00:03:24 on upload of oifs_43r3_ps_0783_1988050100_123_957_12174427_0_r1903842791_34.zip Wed 04 Jan 2023 11:40:47 PM EST | | Internet access OK - project servers may be temporarily down. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Just out of a meeting with CPDN. They understand what the problem is. The upload server will be restarted but with reduced max http connections to keep it stable as best as possible. At some point very soon (today/tomorrow) they will move the upload server to a new RAID disk array (which is what's causing the problem). They may decide to do the move before restarting depending on how quickly the JASMIN cloud provider can give them the temporary space they need to move the files whilst setting up the new server. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Looks like more data than last time needs moving before things start again? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,718,239 RAC: 8,054 |
Well, I got a good load shifted yesterday. Provided they can get it started once more before the weekend - either tonight or reasonably early tomorrow - I should be able to survive. One thing I've noticed - sometimes when the upload hiccups on an individual file, BOINC comes back and tries it another time. And sometimes it doesn't. So I've got about 20 tasks which I could report if I could just get those loose files tidied up. I'll see what I can do when the next window opens up. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
One thing I've noticed - sometimes when the upload hiccups on an individual file, BOINC comes back and tries it another time. And sometimes it doesn't. So I've got about 20 tasks which I could report if I could just get those loose files tidied up. I'll see what I can do when the next window opens up.Exactly why I would like user control of the order files are uploaded in. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Exactly why I would like user control of the order files are uploaded in. When I was watching it upload my stuff yesterday, during the extended period of up-time, I noticed it picked tasks .zip files seemingly at random, but once it had picked one, it then picked others from the same task-id. E.g., oifs_43r3_ps0123. It usually picked all of them eventually. If it choked on one, it just went on to the next. Near the very end, it slowly picked up all the rest. I do not think it lost any. But this is my impression, not exhaustively writing down everything as it happened. |
Send message Joined: 1 Nov 04 Posts: 185 Credit: 4,166,063 RAC: 857 |
And some messages from BOINC:I would recommend that you set http_debug, rather than http_xfer_debug. The output gives more detail about the reasons for errors. Fr 06 Jan 2023 06:09:48 CET | climateprediction.net | Started upload of oifs_43r3_ps_0261_1992050100_123_961_12177905_0_r539968243_52.zip Fr 06 Jan 2023 06:09:48 CET | climateprediction.net | [http] [ID#69107] Info: Connection 39563 seems to be dead! Fr 06 Jan 2023 06:09:48 CET | climateprediction.net | [http] [ID#69107] Info: Closing connection 39563 Fr 06 Jan 2023 06:09:48 CET | climateprediction.net | [http] [ID#69107] Info: TLSv1.2 (OUT), TLS alert, close notify (256): Fr 06 Jan 2023 06:09:48 CET | climateprediction.net | [http] [ID#69107] Info: Connection 39564 seems to be dead! Fr 06 Jan 2023 06:09:48 CET | climateprediction.net | [http] [ID#69107] Info: Closing connection 39564 Fr 06 Jan 2023 06:09:48 CET | climateprediction.net | [http] [ID#69107] Info: TLSv1.2 (OUT), TLS alert, close notify (256): Fr 06 Jan 2023 06:09:48 CET | climateprediction.net | [http] [ID#69107] Info: Trying 192.171.169.187:80... Fr 06 Jan 2023 06:09:48 CET | climateprediction.net | [http] [ID#69107] Info: TCP_NODELAY set Fr 06 Jan 2023 06:11:48 CET | climateprediction.net | [http] [ID#69107] Info: Connection timed out after 120042 milliseconds Fr 06 Jan 2023 06:11:48 CET | climateprediction.net | [http] [ID#69107] Info: Closing connection 39566 Fr 06 Jan 2023 06:11:48 CET | climateprediction.net | [http] HTTP error: Timeout was reached Fr 06 Jan 2023 06:11:48 CET | | Project communication failed: attempting access to reference site Fr 06 Jan 2023 06:11:48 CET | | [http] HTTP_OP::init_get(): https://www.google.com/ Fr 06 Jan 2023 06:11:48 CET | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0261_1992050100_123_961_12177905_0_r539968243_52.zip: transient HTTP error Fr 06 Jan 2023 06:11:48 CET | climateprediction.net | Backing off 00:03:54 on upload of oifs_43r3_ps_0261_1992050100_123_961_12177905_0_r539968243_52.zip Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: Found bundle for host www.google.com: 0x55f7abfd6b20 [can multiplex] Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: Trying 142.250.186.68:443... Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: TCP_NODELAY set Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: Connected to www.google.com (142.250.186.68) port 443 (#39567) Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: ALPN, offering h2 Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: ALPN, offering http/1.1 Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: successfully set certificate verify locations: Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: CAfile: ca-bundle.crt Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: CApath: /etc/ssl/certs Fr 06 Jan 2023 06:11:48 CET | | [http] [ID#0] Info: TLSv1.3 (OUT), TLS handshake, Client hello (1): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: TLSv1.3 (IN), TLS handshake, Server hello (2): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: TLSv1.3 (IN), TLS handshake, Certificate (11): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: TLSv1.3 (IN), TLS handshake, CERT verify (15): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: TLSv1.3 (IN), TLS handshake, Finished (20): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: TLSv1.3 (OUT), TLS handshake, Finished (20): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: ALPN, server accepted to use h2 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: Server certificate: Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: subject: CN=www.google.com Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: start date: Nov 28 08:19:01 2022 GMT Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: expire date: Feb 20 08:19:00 2023 GMT Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: subjectAltName: host "www.google.com" matched cert's "www.google.com" Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1C3 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: SSL certificate verify ok. Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: Using HTTP2, server supports multi-use Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: Connection state changed (HTTP/2 confirmed) Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: Using Stream ID: 1 (easy handle 0x55f7ab667610) Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Sent header to server: GET / HTTP/2 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Sent header to server: Host: www.google.com Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Sent header to server: user-agent: BOINC client (x86_64-pc-linux-gnu 7.20.5) Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Sent header to server: accept: */* Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Sent header to server: accept-encoding: deflate, gzip, br Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Sent header to server: accept-language: de_DE Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Sent header to server: Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: old SSL session ID is stale, removing Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: Connection state changed (MAX_CONCURRENT_STREAMS == 100)! Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: HTTP/2 200 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: date: Fri, 06 Jan 2023 05:11:49 GMT Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: expires: -1 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: cache-control: private, max-age=0 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: content-type: text/html; charset=ISO-8859-1 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: cross-origin-opener-policy-report-only: same-origin-allow-popups; report-to="gws" Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: report-to: {"group":"gws","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/gws/other"}]} Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: p3p: CP="This is not a P3P policy! See g.co/p3phelp for more info." Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: content-encoding: gzip Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: server: gws Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: content-length: 6232 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: x-xss-protection: 0 Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: x-frame-options: SAMEORIGIN Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: set-cookie: SOCS=CAAaBgiAyd2dBg; expires=Mon, 05-Feb-2024 05:11:49 GMT; path=/; domain=.google.com; Secure; SameSite=lax Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: set-cookie: AEC=AakniGM2E1EeHPeZSWfQ6062Jeir2Magj7ySQ1Snmd6NThW2sRD7ZFvN_A; expires=Wed, 05-Jul-2023 05:11:49 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: set-cookie: __Secure-ENID=9.SE=EF-LNAMBS6pD7OjuQrgqF--mZzeC0absCCjqrfNcsPaOpEGcA1p1b6Qa7YZnMOZUySVm_4wBQM72qLSpUtItO_fNA4ZOyUmj2clJ74aZxKRjuKDzMdv2-3lpJHBh6PO7Ci6C4qgsQg7x4oQ7a7zth9TJHoKr_Na1c2MWhyevzSY; expires=Mon, 05-Feb-2024 21:30:07 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: set-cookie: CONSENT=PENDING+350; expires=Sun, 05-Jan-2025 05:11:49 GMT; path=/; domain=.google.com; Secure Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43" Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Received header from server: Fr 06 Jan 2023 06:11:49 CET | | [http] [ID#0] Info: Connection #39567 to host www.google.com left intact Fr 06 Jan 2023 06:11:49 CET | | Internet access OK - project servers may be temporarily down. Better? Grüße vom Sänger |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,718,239 RAC: 8,054 |
Well, it gives us those three useful lines. They tell you what you're trying to do, and why it failed. And it's fairly obvious that there's nothing we, as end users, can do about it: the project has to sort that out.And some messages from BOINC:I would recommend that you set http_debug, rather than http_xfer_debug. The output gives more detail about the reasons for errors. The guff about contacting the 'reference site' (google.com) might have been useful in the early days of BOINC, but I think we all have better and quicker ways of knowing if our internet connection is working these days. You can turn that off in configuration. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Latest update from CPDN as of 6/Jan 10am: Upload server remains inaccessible to CPDN staff. The JASMIN cloud provider are creating a larger block store in order to deal with the Tb volumes to recover the data before recreating the upload server. Unfortunately, this will take some time. My personal guess is the server will not be back today. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,718,239 RAC: 8,054 |
Well, my new 2 TB SSD arrived by post this morning - that's probably a faster data transmission rate than the internet, just at the moment. The question is - dare I attempt to install and mount it as BOINC's data drive, while I'm still hosting around 90 results waiting to upload and report? On balance, I think probably not. |
©2024 cpdn.org