climateprediction.net (CPDN) home page
Thread 'ANOTHER UPLOAD PROBLEM'

Thread 'ANOTHER UPLOAD PROBLEM'

Message boards : Number crunching : ANOTHER UPLOAD PROBLEM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 33 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50741 - Posted: 5 Nov 2014, 19:29:50 UTC - in response to Message 50739.  
Last modified: 5 Nov 2014, 19:55:43 UTC

I think this may be a lot worse than the Australian server being down.
Or, in the case of the 2 models listed, the re-start server at Oxford.

That computer has a LOT of "still running" models showing on it's list from way back.
1263799

So, some questions:

Is that computer also running work for other projects?
Are there 12 climate models showing in it's Tasks tab?
What message(s) is/are showing in the Event Log when there's an upload attempt?

And I think that you should start a new thread for this, as it may take several posts to sort out.
ID: 50741 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,722,381
RAC: 7,664
Message 50742 - Posted: 5 Nov 2014, 23:49:34 UTC - in response to Message 50741.  

Les, don't forget the possibility of 'ghost WUs'.
ID: 50742 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50743 - Posted: 6 Nov 2014, 0:04:26 UTC - in response to Message 50742.  

Ah, yes. I did. Thanks.
Thought of something else, but I've forgotten since. :(

ID: 50743 · Report as offensive     Reply Quote
rjs5

Send message
Joined: 16 Jun 05
Posts: 16
Credit: 19,497,093
RAC: 9,315
Message 50775 - Posted: 10 Nov 2014, 14:53:00 UTC - in response to Message 50741.  

Les,
Sorry, but I could not figure out how to start a new thread.
Answers at end.



I think this may be a lot worse than the Australian server being down.
Or, in the case of the 2 models listed, the re-start server at Oxford.

That computer has a LOT of "still running" models showing on it's list from way back.
1263799

So, some questions:

Is that computer also running work for other projects?
Are there 12 climate models showing in it's Tasks tab?
What message(s) is/are showing in the Event Log when there's an upload attempt?

And I think that you should start a new thread for this, as it may take several posts to sort out.



1. Yes, the computer is also running work for other projects. PrimeGrid, World Community Grid, MilkyWay (GPU only), Rosetta, and Einstein (GPU only).
2. Yes, there are 12 climate models showing in the TASKs tab. Two have completed after 200 hours each (a week or so ago) and 10 more are in the QUEUE. It looks like about 300 compute hours left.
3. When I push the RETRY NOW button to upload the completed tasks, I get the sequence of messages:

11/10/2014 5:57:32 AM | climateprediction.net | Started upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip
11/10/2014 5:57:36 AM | climateprediction.net | Started upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip
11/10/2014 6:02:38 AM | climateprediction.net | Temporarily failed upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip: transient HTTP error
11/10/2014 6:02:38 AM | climateprediction.net | Backing off 05:03:32 on upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip
11/10/2014 6:02:39 AM | | Project communication failed: attempting access to reference site
11/10/2014 6:02:40 AM | | Internet access OK - project servers may be temporarily down.
11/10/2014 6:02:43 AM | climateprediction.net | Temporarily failed upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip: transient HTTP error
11/10/2014 6:02:43 AM | climateprediction.net | Backing off 03:47:13 on upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip
11/10/2014 6:02:44 AM | | Project communication failed: attempting access to reference site
11/10/2014 6:02:45 AM | | Internet access OK - project servers may be temporarily down.

ID: 50775 · Report as offensive     Reply Quote
rjs5

Send message
Joined: 16 Jun 05
Posts: 16
Credit: 19,497,093
RAC: 9,315
Message 50777 - Posted: 10 Nov 2014, 16:35:26 UTC - in response to Message 50741.  

The "eu" path seems to be working fine for trickle up.

11/10/2014 7:32:32 AM | climateprediction.net | Started upload of hadam3p_eu_h6e3_2013_1_008862108_0_3.zip
11/10/2014 7:36:25 AM | climateprediction.net | Finished upload of hadam3p_eu_h6e3_2013_1_008862108_0_3.zip

ID: 50777 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 50816 - Posted: 13 Nov 2014, 20:26:27 UTC - in response to Message 50777.  
Last modified: 13 Nov 2014, 20:35:06 UTC

A bit of (picky) terminology (which might help others; terms changed over CPDN's years) -- this is the 'current final' version:

Trickles are small files with minimal science, at Checkpoints, used to award credit (credit not held until the end, as with other projects) and keep the head shed updated on tasks' progress. They do not show in the log.

Task is a single job, in the queue or in progress.

Work Unit is the total of identical Tasks allowed in case initial and subsequent attempts fail; current number of attempts allowed is five.

For most recent Tasks, twelve .zip files go to servers in England, Oregon (US), or Australia. The last .zip, #13, goes to it's own server at Oxford. This is the restart file which allows the next increment in the sequence of years to be sent out. (We used to run 160-year HADcm3 tasks as a single task on one computer. After numerous complaints, the work was chopped into pieces -- shorter runs, with a penalty of longer uploads and downloads.)

[Edited for typo.]
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 50816 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50897 - Posted: 28 Nov 2014, 11:09:08 UTC

Task 17391789 is failing to upload it's second zip file. It starts, gets as far as 0.91MB (1.43%) Only message is Temporarily failed upload of....... Backing off....

Uploads to beta site working though I do need to nurse them through by hitting the retry now button a few times.
ID: 50897 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50898 - Posted: 28 Nov 2014, 12:53:52 UTC

Server now fixed.
ID: 50898 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,053,019
RAC: 14,719
Message 51212 - Posted: 14 Jan 2015, 16:17:18 UTC - in response to Message 50898.  

I'm getting an upload failure on a "short" model which has completed and uploaded two trickles but is failing to upload the two zip files:

14/01/2015 08:09:15 | climateprediction.net | Started upload of hadcm3s_73wm_1980_2_009364084_1_1.zip
14/01/2015 08:09:17 | climateprediction.net | Temporarily failed upload of hadcm3s_73wm_1980_2_009364084_1_1.zip: connect() failed
14/01/2015 08:09:17 | climateprediction.net | Backing off 04:35:55 on upload of hadcm3s_73wm_1980_2_009364084_1_1.zip
14/01/2015 08:09:20 | | Project communication failed: attempting access to reference site
14/01/2015 08:09:21 | | Internet access OK - project servers may be temporarily down.
14/01/2015 10:34:01 | climateprediction.net | Sending scheduler request: To send trickle-up message.
14/01/2015 10:34:01 | climateprediction.net | Not requesting tasks: some task is suspended via Manager
14/01/2015 10:34:03 | climateprediction.net | Scheduler request completed
14/01/2015 10:34:11 | climateprediction.net | Started upload of hadcm3s_73wm_1980_2_009364084_1_2.zip
14/01/2015 10:34:14 | climateprediction.net | Temporarily failed upload of hadcm3s_73wm_1980_2_009364084_1_2.zip: connect() failed
14/01/2015 10:34:14 | climateprediction.net | Backing off 00:02:01 on upload of hadcm3s_73wm_1980_2_009364084_1_2.zip
14/01/2015 10:34:17 | | Project communication failed: attempting access to reference site
14/01/2015 10:34:18 | | Internet access OK - project servers may be temporarily down.


This has been going on since yesterday evening and the zips are stuck in the transfer tab. Curiously I cannot find a folder in the projects folder in BOINC data that corresponds to the model that has been run. Is there a file missing somsewhere?
ID: 51212 · Report as offensive     Reply Quote
peterfilla

Send message
Joined: 27 Sep 04
Posts: 27
Credit: 11,115,003
RAC: 0
Message 51215 - Posted: 14 Jan 2015, 17:03:18 UTC

Same problem with several tasks !
ID: 51215 · Report as offensive     Reply Quote
Kevin

Send message
Joined: 5 Jul 09
Posts: 63
Credit: 6,091,274
RAC: 0
Message 51218 - Posted: 14 Jan 2015, 18:53:15 UTC - in response to Message 51215.  

Same problem with several tasks !


same here.

ID: 51218 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51219 - Posted: 14 Jan 2015, 18:55:14 UTC - in response to Message 51212.  

Alan

The description of what has happened needs to be very specific here, because it could be a BOINC "feature".

So: You said that the model had completed. Do you mean that it has Reported?
Because if so, that's the end of that model, as below.

**********

There's a situation where there's a problem with BOINC when some the servers are slow, or "down".
This is when the "Network" in the BOINC menu has been turned Off, and all zips / trickles from the point allowed to accumulate on the computer. If there is then a failure of the model at/near the end, BOINC gets a message of the failure, and flags it internally as such.

So, when the "Network" is turned back on, BOINC runs through it's ToDo list, (client_state.xml), which is to start sending back the trickles and the first 2 zips. Then it gets to the part where it has written that the model has failed, and begins to clear everything from it's ToDo list, so it sends back the error messages, and then deletes everything pertaining to that model.

At this point, all of the zips, (and trickles if they are still waiting on a slow server), suddenly disappear. You can see this happen if you're looking at the Transfers tab at the right moment.


ID: 51219 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,722,381
RAC: 7,664
Message 51220 - Posted: 14 Jan 2015, 19:16:35 UTC - in response to Message 51219.  

Another way of looking at it: BOINC regards a task as finished, dealt with, removable, when:

* either it has reached a successful completion and uploaded all its zip files.

* or it has exited abnormally with an error.

The second situation doesn't wait for uploads to complete before doing the housekeeping - the developers didn't consider the 'middle way' where the early, intermediate, data is still valuable, even if the task couldn't make it all the way through to the end.
ID: 51220 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,053,019
RAC: 14,719
Message 51221 - Posted: 14 Jan 2015, 19:58:41 UTC - in response to Message 51220.  

Hi

The model is showing in the task tab as 100% on the progress. There are two zip files waiting in the transfer tab - since some time yesterday evening. Also two trickle files have been successfully uploaded. There is nothing in the stderr tab on the task on my account. Does this give you any clues. I was wondering if a server was down...
ID: 51221 · Report as offensive     Reply Quote
Profiletotoshi

Send message
Joined: 27 Nov 14
Posts: 3
Credit: 678,458
RAC: 0
Message 51223 - Posted: 14 Jan 2015, 22:31:00 UTC

Same here.

The task is still running ( ~ 61%) but BOINC cannot upload the trickle (for a few days).

I.e.:

14.01.2015 23:20:25 | climateprediction.net | Started upload of hadcm3s_78ub_1980_2_009370481_1_1.zip
14.01.2015 23:20:28 | climateprediction.net | Temporarily failed upload of hadcm3s_78ub_1980_2_009370481_1_1.zip: connect() failed
14.01.2015 23:20:28 | climateprediction.net | Backing off 03:56:17 on upload of hadcm3s_78ub_1980_2_009370481_1_1.zip
ID: 51223 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,722,381
RAC: 7,664
Message 51224 - Posted: 14 Jan 2015, 23:56:43 UTC - in response to Message 51223.  

Same here.

The task is still running ( ~ 61%) but BOINC cannot upload the trickle (for a few days).

I.e.:

14.01.2015 23:20:25 | climateprediction.net | Started upload of hadcm3s_78ub_1980_2_009370481_1_1.zip
14.01.2015 23:20:28 | climateprediction.net | Temporarily failed upload of hadcm3s_78ub_1980_2_009370481_1_1.zip: connect() failed
14.01.2015 23:20:28 | climateprediction.net | Backing off 03:56:17 on upload of hadcm3s_78ub_1980_2_009370481_1_1.zip

To help the staff debug your problem, it should be said that what you've reported isn't a trickle problem. They look like

14/01/2015 04:40:32 | climateprediction.net | Sending scheduler request: To send trickle-up message.
14/01/2015 04:40:35 | climateprediction.net | Scheduler request completed

Yours is a data upload problem, and to solve it, we first need to know which upload server the file is being sent to.
ID: 51224 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,352,975
RAC: 13,071
Message 51225 - Posted: 15 Jan 2015, 0:19:58 UTC

I've been having an upload problem since last night.
My log reports

14/01/2015 22:08:50 | climateprediction.net | Started upload of hadcm3n_xc8x_1940_40_009152167_1_4.zip
14/01/2015 22:08:53 | | Project communication failed: attempting access to reference site
14/01/2015 22:08:53 | climateprediction.net | Temporarily failed upload of hadcm3n_xc8x_1940_40_009152167_1_4.zip: connect() failed
14/01/2015 22:08:53 | climateprediction.net | Backing off 03:33:32 on upload of hadcm3n_xc8x_1940_40_009152167_1_4.zip
14/01/2015 22:08:55 | | Internet access OK - project servers may be temporarily down.
ID: 51225 · Report as offensive     Reply Quote
peterfilla

Send message
Joined: 27 Sep 04
Posts: 27
Credit: 11,115,003
RAC: 0
Message 51226 - Posted: 15 Jan 2015, 1:47:14 UTC

The server for the hadcm3c-result-files (63 - 64 MB) has a "connect() failed" - I do not know the name of the server. (also the server for the hadcm3n-result-files ?!).
ID: 51226 · Report as offensive     Reply Quote
peterfilla

Send message
Joined: 27 Sep 04
Posts: 27
Credit: 11,115,003
RAC: 0
Message 51227 - Posted: 15 Jan 2015, 1:48:44 UTC - in response to Message 51226.  

sorry: correction : not "hadcm3c . . ." -> "hadcm3s . . ."
ID: 51227 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51229 - Posted: 15 Jan 2015, 4:58:58 UTC

The server name will be listed in client_state.xml

1) Copy this file.
2) Paste it somewhere outside of the BOINC structure.
3) Open the copy with Notepad.
4) Search for the 4 character model name.
For this one of mine: hadam3p_eu_zwyt_2013_0_009436587_2, the 4 character name is zwyt

5) Keep searching until, a few lines below where you end up, you see the word upload_handler to the right of the string. (The line will be enclosed in upload_url at each end.)
6) Copy that upload line, and paste it here.


ID: 51229 · Report as offensive     Reply Quote
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 33 · Next

Message boards : Number crunching : ANOTHER UPLOAD PROBLEM

©2024 cpdn.org