Message boards : Number crunching : ANOTHER UPLOAD PROBLEM
Message board moderation
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 33 · Next
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I think this may be a lot worse than the Australian server being down. Or, in the case of the 2 models listed, the re-start server at Oxford. That computer has a LOT of "still running" models showing on it's list from way back. 1263799 So, some questions: Is that computer also running work for other projects? Are there 12 climate models showing in it's Tasks tab? What message(s) is/are showing in the Event Log when there's an upload attempt? And I think that you should start a new thread for this, as it may take several posts to sort out. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Les, don't forget the possibility of 'ghost WUs'. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Ah, yes. I did. Thanks. Thought of something else, but I've forgotten since. :( |
Send message Joined: 16 Jun 05 Posts: 16 Credit: 19,471,415 RAC: 10,424 |
Les, Sorry, but I could not figure out how to start a new thread. Answers at end. I think this may be a lot worse than the Australian server being down. 1. Yes, the computer is also running work for other projects. PrimeGrid, World Community Grid, MilkyWay (GPU only), Rosetta, and Einstein (GPU only). 2. Yes, there are 12 climate models showing in the TASKs tab. Two have completed after 200 hours each (a week or so ago) and 10 more are in the QUEUE. It looks like about 300 compute hours left. 3. When I push the RETRY NOW button to upload the completed tasks, I get the sequence of messages: 11/10/2014 5:57:32 AM | climateprediction.net | Started upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip 11/10/2014 5:57:36 AM | climateprediction.net | Started upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip 11/10/2014 6:02:38 AM | climateprediction.net | Temporarily failed upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip: transient HTTP error 11/10/2014 6:02:38 AM | climateprediction.net | Backing off 05:03:32 on upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip 11/10/2014 6:02:39 AM | | Project communication failed: attempting access to reference site 11/10/2014 6:02:40 AM | | Internet access OK - project servers may be temporarily down. 11/10/2014 6:02:43 AM | climateprediction.net | Temporarily failed upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip: transient HTTP error 11/10/2014 6:02:43 AM | climateprediction.net | Backing off 03:47:13 on upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip 11/10/2014 6:02:44 AM | | Project communication failed: attempting access to reference site 11/10/2014 6:02:45 AM | | Internet access OK - project servers may be temporarily down. |
Send message Joined: 16 Jun 05 Posts: 16 Credit: 19,471,415 RAC: 10,424 |
The "eu" path seems to be working fine for trickle up. 11/10/2014 7:32:32 AM | climateprediction.net | Started upload of hadam3p_eu_h6e3_2013_1_008862108_0_3.zip 11/10/2014 7:36:25 AM | climateprediction.net | Finished upload of hadam3p_eu_h6e3_2013_1_008862108_0_3.zip |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
A bit of (picky) terminology (which might help others; terms changed over CPDN's years) -- this is the 'current final' version: Trickles are small files with minimal science, at Checkpoints, used to award credit (credit not held until the end, as with other projects) and keep the head shed updated on tasks' progress. They do not show in the log. Task is a single job, in the queue or in progress. Work Unit is the total of identical Tasks allowed in case initial and subsequent attempts fail; current number of attempts allowed is five. For most recent Tasks, twelve .zip files go to servers in England, Oregon (US), or Australia. The last .zip, #13, goes to it's own server at Oxford. This is the restart file which allows the next increment in the sequence of years to be sent out. (We used to run 160-year HADcm3 tasks as a single task on one computer. After numerous complaints, the work was chopped into pieces -- shorter runs, with a penalty of longer uploads and downloads.) [Edited for typo.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Task 17391789 is failing to upload it's second zip file. It starts, gets as far as 0.91MB (1.43%) Only message is Temporarily failed upload of....... Backing off.... Uploads to beta site working though I do need to nurse them through by hitting the retry now button a few times. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Server now fixed. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,994,950 RAC: 14,359 |
I'm getting an upload failure on a "short" model which has completed and uploaded two trickles but is failing to upload the two zip files: 14/01/2015 08:09:15 | climateprediction.net | Started upload of hadcm3s_73wm_1980_2_009364084_1_1.zip 14/01/2015 08:09:17 | climateprediction.net | Temporarily failed upload of hadcm3s_73wm_1980_2_009364084_1_1.zip: connect() failed 14/01/2015 08:09:17 | climateprediction.net | Backing off 04:35:55 on upload of hadcm3s_73wm_1980_2_009364084_1_1.zip 14/01/2015 08:09:20 | | Project communication failed: attempting access to reference site 14/01/2015 08:09:21 | | Internet access OK - project servers may be temporarily down. 14/01/2015 10:34:01 | climateprediction.net | Sending scheduler request: To send trickle-up message. 14/01/2015 10:34:01 | climateprediction.net | Not requesting tasks: some task is suspended via Manager 14/01/2015 10:34:03 | climateprediction.net | Scheduler request completed 14/01/2015 10:34:11 | climateprediction.net | Started upload of hadcm3s_73wm_1980_2_009364084_1_2.zip 14/01/2015 10:34:14 | climateprediction.net | Temporarily failed upload of hadcm3s_73wm_1980_2_009364084_1_2.zip: connect() failed 14/01/2015 10:34:14 | climateprediction.net | Backing off 00:02:01 on upload of hadcm3s_73wm_1980_2_009364084_1_2.zip 14/01/2015 10:34:17 | | Project communication failed: attempting access to reference site 14/01/2015 10:34:18 | | Internet access OK - project servers may be temporarily down. This has been going on since yesterday evening and the zips are stuck in the transfer tab. Curiously I cannot find a folder in the projects folder in BOINC data that corresponds to the model that has been run. Is there a file missing somsewhere? |
Send message Joined: 27 Sep 04 Posts: 27 Credit: 11,115,003 RAC: 0 |
Same problem with several tasks ! |
Send message Joined: 5 Jul 09 Posts: 63 Credit: 6,091,274 RAC: 0 |
Same problem with several tasks ! same here. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Alan The description of what has happened needs to be very specific here, because it could be a BOINC "feature". So: You said that the model had completed. Do you mean that it has Reported? Because if so, that's the end of that model, as below. ********** There's a situation where there's a problem with BOINC when some the servers are slow, or "down". This is when the "Network" in the BOINC menu has been turned Off, and all zips / trickles from the point allowed to accumulate on the computer. If there is then a failure of the model at/near the end, BOINC gets a message of the failure, and flags it internally as such. So, when the "Network" is turned back on, BOINC runs through it's ToDo list, (client_state.xml), which is to start sending back the trickles and the first 2 zips. Then it gets to the part where it has written that the model has failed, and begins to clear everything from it's ToDo list, so it sends back the error messages, and then deletes everything pertaining to that model. At this point, all of the zips, (and trickles if they are still waiting on a slow server), suddenly disappear. You can see this happen if you're looking at the Transfers tab at the right moment. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Another way of looking at it: BOINC regards a task as finished, dealt with, removable, when: * either it has reached a successful completion and uploaded all its zip files. * or it has exited abnormally with an error. The second situation doesn't wait for uploads to complete before doing the housekeeping - the developers didn't consider the 'middle way' where the early, intermediate, data is still valuable, even if the task couldn't make it all the way through to the end. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,994,950 RAC: 14,359 |
Hi The model is showing in the task tab as 100% on the progress. There are two zip files waiting in the transfer tab - since some time yesterday evening. Also two trickle files have been successfully uploaded. There is nothing in the stderr tab on the task on my account. Does this give you any clues. I was wondering if a server was down... |
Send message Joined: 27 Nov 14 Posts: 3 Credit: 678,458 RAC: 0 |
Same here. The task is still running ( ~ 61%) but BOINC cannot upload the trickle (for a few days). I.e.: 14.01.2015 23:20:25 | climateprediction.net | Started upload of hadcm3s_78ub_1980_2_009370481_1_1.zip 14.01.2015 23:20:28 | climateprediction.net | Temporarily failed upload of hadcm3s_78ub_1980_2_009370481_1_1.zip: connect() failed 14.01.2015 23:20:28 | climateprediction.net | Backing off 03:56:17 on upload of hadcm3s_78ub_1980_2_009370481_1_1.zip |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Same here. To help the staff debug your problem, it should be said that what you've reported isn't a trickle problem. They look like 14/01/2015 04:40:32 | climateprediction.net | Sending scheduler request: To send trickle-up message. 14/01/2015 04:40:35 | climateprediction.net | Scheduler request completed Yours is a data upload problem, and to solve it, we first need to know which upload server the file is being sent to. |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,304,104 RAC: 13,021 |
I've been having an upload problem since last night. My log reports 14/01/2015 22:08:50 | climateprediction.net | Started upload of hadcm3n_xc8x_1940_40_009152167_1_4.zip 14/01/2015 22:08:53 | | Project communication failed: attempting access to reference site 14/01/2015 22:08:53 | climateprediction.net | Temporarily failed upload of hadcm3n_xc8x_1940_40_009152167_1_4.zip: connect() failed 14/01/2015 22:08:53 | climateprediction.net | Backing off 03:33:32 on upload of hadcm3n_xc8x_1940_40_009152167_1_4.zip 14/01/2015 22:08:55 | | Internet access OK - project servers may be temporarily down. |
Send message Joined: 27 Sep 04 Posts: 27 Credit: 11,115,003 RAC: 0 |
The server for the hadcm3c-result-files (63 - 64 MB) has a "connect() failed" - I do not know the name of the server. (also the server for the hadcm3n-result-files ?!). |
Send message Joined: 27 Sep 04 Posts: 27 Credit: 11,115,003 RAC: 0 |
sorry: correction : not "hadcm3c . . ." -> "hadcm3s . . ." |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The server name will be listed in client_state.xml 1) Copy this file. 2) Paste it somewhere outside of the BOINC structure. 3) Open the copy with Notepad. 4) Search for the 4 character model name. For this one of mine: hadam3p_eu_zwyt_2013_0_009436587_2, the 4 character name is zwyt 5) Keep searching until, a few lines below where you end up, you see the word upload_handler to the right of the string. (The line will be enclosed in upload_url at each end.) 6) Copy that upload line, and paste it here. |
©2024 cpdn.org