climateprediction.net (CPDN) home page
Thread 'Download server down'

Thread 'Download server down'

Message boards : Number crunching : Download server down
Message board moderation

To post messages, you must log in.

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 51537 - Posted: 6 Mar 2015, 9:16:52 UTC

It's not currently possible to download the data files for new tasks allocated to your computer.

Staff are investigating a possible hardware failure. More news later.
ID: 51537 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,035,559
RAC: 14,581
Message 51559 - Posted: 7 Mar 2015, 10:21:17 UTC - in response to Message 51537.  

Got these messages on the event log if it helps:

06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::init_get(): http://cpdn-downloads.oerc.ox.ac.uk/download/hadam3p/hadam3p_eu/hadam3p_eu_0qzh_2013_0_009558496.zip
06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::libcurl_exec(): ca-bundle set
06/03/2015 23:38:47 | climateprediction.net | Started download of hadam3p_eu_0qzh_2013_0_009558496.zip
06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::init_get(): http://cpdn-downloads.oerc.ox.ac.uk/download/hadam3p/ancil/mirror.php?file=ic19611001_12_N96.gz
06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
06/03/2015 23:38:47 | climateprediction.net | [http] HTTP_OP::libcurl_exec(): ca-bundle set
06/03/2015 23:38:47 | climateprediction.net | Started download of ic19611001_12_N96.gz
06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: Connection 560 seems to be dead!
06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: Closing connection 560
06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: timeout on name lookup is not supported
06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: Hostname was NOT found in DNS cache
06/03/2015 23:38:47 | climateprediction.net | [http] [ID#121] Info: Trying 129.67.195.71...
06/03/2015 23:38:47 | climateprediction.net | [http] [ID#122] Info: Found bundle for host cpdn-downloads.oerc.ox.ac.uk: 0x2c9cd20
06/03/2015 23:38:47 | climateprediction.net | [http] [ID#122] Info: timeout on name lookup is not supported
06/03/2015 23:38:47 | climateprediction.net | [http] [ID#122] Info: Hostname was found in DNS cache
06/03/2015 23:38:47 | climateprediction.net | [http] [ID#122] Info: Trying 129.67.195.71...
06/03/2015 23:38:48 | climateprediction.net | [http] [ID#121] Info: connect to 129.67.195.71 port 80 failed: Connection refused
06/03/2015 23:38:48 | climateprediction.net | [http] [ID#121] Info: Failed to connect to cpdn-downloads.oerc.ox.ac.uk port 80: Connection refused
06/03/2015 23:38:48 | climateprediction.net | [http] [ID#121] Info: Closing connection 561
06/03/2015 23:38:48 | climateprediction.net | [http] [ID#122] Info: connect to 129.67.195.71 port 80 failed: Connection refused
06/03/2015 23:38:48 | climateprediction.net | [http] [ID#122] Info: Failed to connect to cpdn-downloads.oerc.ox.ac.uk port 80: Connection refused
ID: 51559 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 51560 - Posted: 7 Mar 2015, 10:56:48 UTC - in response to Message 51559.  

That's progress, of a sort - yesterday it was "Failed to connect: Timed out". The server is also responding to pings today, which it wasn't yesterday.

All of which suggests somebody is working on the problem. How long it'll take depends on what failed, how much of the operating system needs to be reloaded/migrated/configured, and whether any data needs to be copied. I haven't heard any news on those fronts.
ID: 51560 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 51561 - Posted: 7 Mar 2015, 17:42:14 UTC

I am getting this even though the status page says everything is up.

07-Mar-2015 12:32:59 [climateprediction.net] Started download of RF4_91.astart.gz
07-Mar-2015 12:32:59 [climateprediction.net] Started download of HadISST_SI_N96_1991_12_2002_12g.gz
07-Mar-2015 12:33:00 [---] Project communication failed: attempting access to reference site
07-Mar-2015 12:33:00 [climateprediction.net] Temporarily failed download of RF4_91.astart.gz: connect() failed
07-Mar-2015 12:33:00 [climateprediction.net] Backing off 00:13:12 on download of RF4_91.astart.gz
07-Mar-2015 12:33:00 [climateprediction.net] Temporarily failed download of HadISST_SI_N96_1991_12_2002_12g.gz: connect() failed
07-Mar-2015 12:33:00 [climateprediction.net] Backing off 00:11:21 on download of HadISST_SI_N96_1991_12_2002_12g.gz
07-Mar-2015 12:33:00 [climateprediction.net] Started download of HadISST_SST_N96_1991_12_2002_12g.gz
07-Mar-2015 12:33:01 [---] Internet access OK - project servers may be temporarily down.

ID: 51561 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 51562 - Posted: 7 Mar 2015, 20:05:37 UTC - in response to Message 51561.  

The trouble is the server status page only monitors climateapps2, which so far as I know is still working. The file you are trying to download comes from cpdn-downloads.oerc which failed early on Friday morning (and by your report, hasn't been repaired yet). I've suggected that both download servers should be monitored on the SSP to avoid exactly this confusion, but fixing the broken server comes first.
ID: 51562 · Report as offensive     Reply Quote
Jonathan Miller

Send message
Joined: 27 Jul 12
Posts: 21
Credit: 269,602
RAC: 0
Message 51578 - Posted: 9 Mar 2015, 13:02:39 UTC - in response to Message 51562.  

The download server is now running again.

The reason for the lack of monitoring for the downloads server on the Server Status page is that we used a load-balancer to distribute work between climateapps2 and other servers (cpdn-downloads and uploader1.atm), but because of the way it was implemented, all requests had to come to climateapps2.

This is yet another legacy issue that is due to be overhauled soon.

Apologies for the inconvenience.

Jonathan

CPDN SysAdmin
ID: 51578 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 51579 - Posted: 9 Mar 2015, 14:43:40 UTC

Thanks Jonathan, my downloads have just completed. :)
ID: 51579 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 2,013,293
RAC: 392
Message 51580 - Posted: 9 Mar 2015, 15:25:53 UTC
Last modified: 9 Mar 2015, 15:26:15 UTC

It seems like some files got moved or deleted from the servers.

Have at least (for now) two WU with a error like this:

1st
09/03/2015 14:07:15 | climateprediction.net | Started download of hadam3p_eu_0sj2_2013_0_009560497.zip
09/03/2015 14:07:16 | climateprediction.net | Giving up on download of hadam3p_eu_0sj2_2013_0_009560497.zip: permanent HTTP error


2nd
09/03/2015 15:00:45 | climateprediction.net | Requesting new tasks for CPU
09/03/2015 15:00:47 | climateprediction.net | Scheduler request completed: got 1 new tasks
09/03/2015 15:00:50 | climateprediction.net | Started download of hadam3p_eu_0t7x_2013_0_009561392.zip
09/03/2015 15:00:50 | climateprediction.net | Started download of ic19610507_10_N96.gz
09/03/2015 15:00:51 | climateprediction.net | Giving up on download of hadam3p_eu_0t7x_2013_0_009561392.zip: permanent HTTP error
09/03/2015 15:00:51 | climateprediction.net | Started download of atmos_n0h9.day.gz
09/03/2015 15:00:53 | climateprediction.net | Finished download of ic19610507_10_N96.gz
09/03/2015 15:00:53 | climateprediction.net | Started download of region_n0h9.day.gz
09/03/2015 15:01:16 | climateprediction.net | Finished download of region_n0h9.day.gz
09/03/2015 15:01:16 | climateprediction.net | Started download of delta_GloSea5_2013_mem027_GFDL-CM3_all.gz
09/03/2015 15:01:18 | climateprediction.net | Finished download of atmos_n0h9.day.gz
09/03/2015 15:01:18 | climateprediction.net | Finished download of delta_GloSea5_2013_mem027_GFDL-CM3_all.gz


Professor Desty Nova
Researching Karma the Hard Way
ID: 51580 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 51581 - Posted: 9 Mar 2015, 15:46:03 UTC

I suspect it may just be pressure on the server as all the machines try and download at once.
ID: 51581 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 51582 - Posted: 9 Mar 2015, 16:07:19 UTC - in response to Message 51581.  

I suspect it may just be pressure on the server as all the machines try and download at once.

Unlikely. I had four failures (out of 19 files) in the set of four tasks I was allocated on Friday, and which prompted me to open this thread. The files which failed - one per task - exactly match the template which Professor Desty Nova has illustrated.

I reported that by email nearly two hours ago, but the mail server hasn't yet redistributed it to other subscribers.

I suspect that it's more likely that one or more 'mount points' - where files stored on one physical server are made available to other servers in the collection - may not have re-established themselves correctly after the download server was re-activated.
ID: 51582 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 51583 - Posted: 9 Mar 2015, 16:24:56 UTC
Last modified: 9 Mar 2015, 16:26:17 UTC

Richard, I was just coming back to say I had changed my mind.
.
Mon 09 Mar 2015 15:46:58 GMT | climateprediction.net | Started download of hadam3pm2_k2za_1959_10_009465089.zip
Mon 09 Mar 2015 15:47:00 GMT | climateprediction.net | Giving up on download of hadam3pm2_k2za_1959_10_009465089.zip: permanent HTTP error

Had I read the post more carefully to start with I would have seen it was a permanent HTTP error rather than a transient one.

Which brings me to my next question. If BOINC has given up on this download, should I abort the task to allow it to be reissued?
ID: 51583 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 51584 - Posted: 9 Mar 2015, 17:19:14 UTC - in response to Message 51583.  
Last modified: 9 Mar 2015, 17:24:56 UTC

Richard, I was just coming back to say I had changed my mind.
.
Mon 09 Mar 2015 15:46:58 GMT | climateprediction.net | Started download of hadam3pm2_k2za_1959_10_009465089.zip
Mon 09 Mar 2015 15:47:00 GMT | climateprediction.net | Giving up on download of hadam3pm2_k2za_1959_10_009465089.zip: permanent HTTP error

Had I read the post more carefully to start with I would have seen it was a permanent HTTP error rather than a transient one.

Which brings me to my next question. If BOINC has given up on this download, should I abort the task to allow it to be reissued?

It'll be marked as an error already, and will report itself at the end of the 1-hour server RPC backoff.

Edit - probably best to hang on to it for as long as your BOINC client will allow. If it's reissued before the missing file has been made accessible to the download server, it'll just waste someone's bandwidth and fail again.
ID: 51584 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 51585 - Posted: 9 Mar 2015, 17:24:20 UTC
Last modified: 9 Mar 2015, 17:25:18 UTC

I noticed that the download files that were stuck in my transfer tab all weekend are now gone, but, the 2 tasks now show as �download failed." Is the download server fixed?
ID: 51585 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 51586 - Posted: 9 Mar 2015, 17:26:22 UTC - in response to Message 51585.  

I noticed that the download files that were stuck in my transfer tab all weekend are now gone, but, the 2 tasks now show as �download� failed. Is the download server fixed?

The server is fine, but it looks as if some of the files it should be serving are inaccessible or otherwise AWOL.
ID: 51586 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 51587 - Posted: 9 Mar 2015, 20:18:39 UTC

The one with the failed download has disappeared from my tasks list now.
ID: 51587 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 2,013,293
RAC: 392
Message 51589 - Posted: 9 Mar 2015, 22:56:54 UTC

I'm guessing, with this mass of errors, that they will have to reissue the batch of WU from last week...

(All the WU that gave me "Error while downloading", have the same result on other two wingman)


Professor Desty Nova
Researching Karma the Hard Way
ID: 51589 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 51592 - Posted: 10 Mar 2015, 14:46:01 UTC

It turns out that the folder where the task files are stored for downloading got corrupted in the server crash, and couldn't be recovered. So the tasks awaiting allocation last week have been blocked and won't be reissued just now.

A new batch of EU tasks has been released today, and downloaded to my machine without problems. No doubt the researchers will be looking into how far the last batch got before the crash, and working out how many, if any, they need to re-submit.
ID: 51592 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 51597 - Posted: 10 Mar 2015, 17:00:32 UTC - in response to Message 51592.  

Do we need to report models that need to be reissued? I aborted two during the download, failure but they are still in progress on the web.

by the way I've just got Server error: feeder not running when sending an update from my BOINC.

Cheers
ID: 51597 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 51598 - Posted: 10 Mar 2015, 17:18:30 UTC - in response to Message 51597.  

I don't think you need to report which tasks failed and which succeeded - it will be easier for the staff to run bulk queries against the database and get the overall picture from that. After all, the vast majority of participants never read, let alone post on, these message boards, so they'd get a very sketchy picture.

I'd wait and see how the feeder fault sorts itself out overnight. The staff are probably still mopping up the debris, and the server is busy catching up on delayed trickles. If there's still a problem in the morning (UK time - UTC), we can alert them then.
ID: 51598 · Report as offensive     Reply Quote

Message boards : Number crunching : Download server down

©2024 cpdn.org