Message boards : Number crunching : Download server down
Message board moderation
Author | Message |
---|---|
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969 |
It's not currently possible to download the data files for new tasks allocated to your computer. Staff are investigating a possible hardware failure. More news later. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,083,753 RAC: 15,077 |
Got these messages on the event log if it helps: 06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::init_get(): http://cpdn-downloads.oerc.ox.ac.uk/download/hadam3p/hadam3p_eu/hadam3p_eu_0qzh_2013_0_009558496.zip 06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt' 06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::libcurl_exec(): ca-bundle set 06/03/2015 23:38:47 | climateprediction.net | Started download of hadam3p_eu_0qzh_2013_0_009558496.zip 06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::init_get(): http://cpdn-downloads.oerc.ox.ac.uk/download/hadam3p/ancil/mirror.php?file=ic19611001_12_N96.gz 06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt' 06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::libcurl_exec(): ca-bundle set 06/03/2015 23:38:47 | climateprediction.net | Started download of ic19611001_12_N96.gz 06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: Connection 560 seems to be dead! 06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: Closing connection 560 06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: timeout on name lookup is not supported 06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: Hostname was NOT found in DNS cache 06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: Trying 129.67.195.71... 06/03/2015 23:38:47 | climateprediction.net | [http] [ID#122] Info: Found bundle for host cpdn-downloads.oerc.ox.ac.uk: 0x2c9cd20 06/03/2015 23:38:47 | climateprediction.net | [http] [ID#122] Info: timeout on name lookup is not supported 06/03/2015 23:38:47 | climateprediction.net | [http] [ID#122] Info: Hostname was found in DNS cache 06/03/2015 23:38:47 | climateprediction.net | [http] [ID#122] Info: Trying 129.67.195.71... 06/03/2015 23:38:48 | climateprediction.net | [http] [ID#121] Info: connect to 129.67.195.71 port 80 failed: Connection refused 06/03/2015 23:38:48 | climateprediction.net | [http] [ID#121] Info: Failed to connect to cpdn-downloads.oerc.ox.ac.uk port 80: Connection refused 06/03/2015 23:38:48 | climateprediction.net | [http] [ID#121] Info: Closing connection 561 06/03/2015 23:38:48 | climateprediction.net | [http] [ID#122] Info: connect to 129.67.195.71 port 80 failed: Connection refused 06/03/2015 23:38:48 | climateprediction.net | [http] [ID#122] Info: Failed to connect to cpdn-downloads.oerc.ox.ac.uk port 80: Connection refused |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969 |
That's progress, of a sort - yesterday it was "Failed to connect: Timed out". The server is also responding to pings today, which it wasn't yesterday. All of which suggests somebody is working on the problem. How long it'll take depends on what failed, how much of the operating system needs to be reloaded/migrated/configured, and whether any data needs to be copied. I haven't heard any news on those fronts. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I am getting this even though the status page says everything is up. 07-Mar-2015 12:32:59 [climateprediction.net] Started download of RF4_91.astart.gz 07-Mar-2015 12:32:59 [climateprediction.net] Started download of HadISST_SI_N96_1991_12_2002_12g.gz 07-Mar-2015 12:33:00 [---] Project communication failed: attempting access to reference site 07-Mar-2015 12:33:00 [climateprediction.net] Temporarily failed download of RF4_91.astart.gz: connect() failed 07-Mar-2015 12:33:00 [climateprediction.net] Backing off 00:13:12 on download of RF4_91.astart.gz 07-Mar-2015 12:33:00 [climateprediction.net] Temporarily failed download of HadISST_SI_N96_1991_12_2002_12g.gz: connect() failed 07-Mar-2015 12:33:00 [climateprediction.net] Backing off 00:11:21 on download of HadISST_SI_N96_1991_12_2002_12g.gz 07-Mar-2015 12:33:00 [climateprediction.net] Started download of HadISST_SST_N96_1991_12_2002_12g.gz 07-Mar-2015 12:33:01 [---] Internet access OK - project servers may be temporarily down. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969 |
The trouble is the server status page only monitors climateapps2, which so far as I know is still working. The file you are trying to download comes from cpdn-downloads.oerc which failed early on Friday morning (and by your report, hasn't been repaired yet). I've suggected that both download servers should be monitored on the SSP to avoid exactly this confusion, but fixing the broken server comes first. |
Send message Joined: 27 Jul 12 Posts: 21 Credit: 269,602 RAC: 0 |
The download server is now running again. The reason for the lack of monitoring for the downloads server on the Server Status page is that we used a load-balancer to distribute work between climateapps2 and other servers (cpdn-downloads and uploader1.atm), but because of the way it was implemented, all requests had to come to climateapps2. This is yet another legacy issue that is due to be overhauled soon. Apologies for the inconvenience. Jonathan CPDN SysAdmin |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Thanks Jonathan, my downloads have just completed. :) |
Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,014,122 RAC: 399 |
It seems like some files got moved or deleted from the servers. Have at least (for now) two WU with a error like this: 1st 09/03/2015 14:07:15 | climateprediction.net | Started download of hadam3p_eu_0sj2_2013_0_009560497.zip 09/03/2015 14:07:16 | climateprediction.net | Giving up on download of hadam3p_eu_0sj2_2013_0_009560497.zip: permanent HTTP error 2nd 09/03/2015 15:00:45 | climateprediction.net | Requesting new tasks for CPU 09/03/2015 15:00:47 | climateprediction.net | Scheduler request completed: got 1 new tasks 09/03/2015 15:00:50 | climateprediction.net | Started download of hadam3p_eu_0t7x_2013_0_009561392.zip 09/03/2015 15:00:50 | climateprediction.net | Started download of ic19610507_10_N96.gz 09/03/2015 15:00:51 | climateprediction.net | Giving up on download of hadam3p_eu_0t7x_2013_0_009561392.zip: permanent HTTP error 09/03/2015 15:00:51 | climateprediction.net | Started download of atmos_n0h9.day.gz 09/03/2015 15:00:53 | climateprediction.net | Finished download of ic19610507_10_N96.gz 09/03/2015 15:00:53 | climateprediction.net | Started download of region_n0h9.day.gz 09/03/2015 15:01:16 | climateprediction.net | Finished download of region_n0h9.day.gz 09/03/2015 15:01:16 | climateprediction.net | Started download of delta_GloSea5_2013_mem027_GFDL-CM3_all.gz 09/03/2015 15:01:18 | climateprediction.net | Finished download of atmos_n0h9.day.gz 09/03/2015 15:01:18 | climateprediction.net | Finished download of delta_GloSea5_2013_mem027_GFDL-CM3_all.gz Professor Desty Nova Researching Karma the Hard Way |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I suspect it may just be pressure on the server as all the machines try and download at once. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969 |
I suspect it may just be pressure on the server as all the machines try and download at once. Unlikely. I had four failures (out of 19 files) in the set of four tasks I was allocated on Friday, and which prompted me to open this thread. The files which failed - one per task - exactly match the template which Professor Desty Nova has illustrated. I reported that by email nearly two hours ago, but the mail server hasn't yet redistributed it to other subscribers. I suspect that it's more likely that one or more 'mount points' - where files stored on one physical server are made available to other servers in the collection - may not have re-established themselves correctly after the download server was re-activated. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Richard, I was just coming back to say I had changed my mind. . Mon 09 Mar 2015 15:46:58 GMT | climateprediction.net | Started download of hadam3pm2_k2za_1959_10_009465089.zip Mon 09 Mar 2015 15:47:00 GMT | climateprediction.net | Giving up on download of hadam3pm2_k2za_1959_10_009465089.zip: permanent HTTP error Had I read the post more carefully to start with I would have seen it was a permanent HTTP error rather than a transient one. Which brings me to my next question. If BOINC has given up on this download, should I abort the task to allow it to be reissued? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969 |
Richard, I was just coming back to say I had changed my mind. It'll be marked as an error already, and will report itself at the end of the 1-hour server RPC backoff. Edit - probably best to hang on to it for as long as your BOINC client will allow. If it's reissued before the missing file has been made accessible to the download server, it'll just waste someone's bandwidth and fail again. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I noticed that the download files that were stuck in my transfer tab all weekend are now gone, but, the 2 tasks now show as �download failed." Is the download server fixed? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969 |
I noticed that the download files that were stuck in my transfer tab all weekend are now gone, but, the 2 tasks now show as �download� failed. Is the download server fixed? The server is fine, but it looks as if some of the files it should be serving are inaccessible or otherwise AWOL. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
The one with the failed download has disappeared from my tasks list now. |
Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,014,122 RAC: 399 |
I'm guessing, with this mass of errors, that they will have to reissue the batch of WU from last week... (All the WU that gave me "Error while downloading", have the same result on other two wingman) Professor Desty Nova Researching Karma the Hard Way |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969 |
It turns out that the folder where the task files are stored for downloading got corrupted in the server crash, and couldn't be recovered. So the tasks awaiting allocation last week have been blocked and won't be reissued just now. A new batch of EU tasks has been released today, and downloaded to my machine without problems. No doubt the researchers will be looking into how far the last batch got before the crash, and working out how many, if any, they need to re-submit. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846 |
Do we need to report models that need to be reissued? I aborted two during the download, failure but they are still in progress on the web. by the way I've just got Server error: feeder not running when sending an update from my BOINC. Cheers |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,730,664 RAC: 6,969 |
I don't think you need to report which tasks failed and which succeeded - it will be easier for the staff to run bulk queries against the database and get the overall picture from that. After all, the vast majority of participants never read, let alone post on, these message boards, so they'd get a very sketchy picture. I'd wait and see how the feeder fault sorts itself out overnight. The staff are probably still mopping up the debris, and the server is busy catching up on delayed trickles. If there's still a problem in the morning (UK time - UTC), we can alert them then. |
©2024 cpdn.org