Message boards : Number crunching : Upload problems
Message board moderation
Author | Message |
---|---|
Send message Joined: 23 Oct 05 Posts: 22 Credit: 526,746 RAC: 0 |
I noticed that my 24/7 linux server had not gotten any credits from climateprediction for a while, so I thought I'd look and see what's happening. It seems it cannot upload result files despite the server status on you page saying ok. The messages from boinc are posted below. As you can see, other projects communicate just fine. Any suggestions? Thanks! 2011-04-09 16:48:01 rosetta@home Reporting 1 completed tasks, not requesting new tasks 2011-04-09 16:48:02 rosetta@home Started upload of mem_tid3_run06_A_1afo_SAVE_ALL_OUT_IGNORE_THE_REST_22930_14947_0_0 2011-04-09 16:48:06 rosetta@home Scheduler request completed 2011-04-09 16:48:09 rosetta@home Finished upload of mem_tid3_run06_A_1afo_SAVE_ALL_OUT_IGNORE_THE_REST_22930_14947_0_0 2011-04-09 16:48:11 World Community Grid Sending scheduler request: To fetch work. 2011-04-09 16:48:11 World Community Grid Reporting 3 completed tasks, requesting new tasks 2011-04-09 16:48:16 World Community Grid Scheduler request completed: got 1 new tasks 2011-04-09 16:48:18 World Community Grid Started download of E201784_622_C.22.C20H14N2.00010713.3.set1d06_C.22.C20H14N2.00010713.3.zip 2011-04-09 16:48:21 World Community Grid Finished download of E201784_622_C.22.C20H14N2.00010713.3.set1d06_C.22.C20H14N2.00010713.3.zip 2011-04-09 17:05:00 climateprediction.net Started upload of famous_xaby_1999_200_007075221_2_10.zip 2011-04-09 17:05:00 climateprediction.net Started upload of famous_voj9_999_200_006734889_5_2.zip 2011-04-09 17:05:24 Project communication failed: attempting access to reference site 2011-04-09 17:05:24 climateprediction.net Temporarily failed upload of famous_xaby_1999_200_007075221_2_10.zip: connect() failed 2011-04-09 17:05:24 climateprediction.net Backing off 1 hr 38 min 10 sec on upload of famous_xaby_1999_200_007075221_2_10.zip 2011-04-09 17:05:24 climateprediction.net Temporarily failed upload of famous_voj9_999_200_006734889_5_2.zip: connect() failed 2011-04-09 17:05:24 climateprediction.net Backing off 2 hr 42 min 44 sec on upload of famous_voj9_999_200_006734889_5_2.zip 2011-04-09 17:05:47 BOINC can't access Internet - check network connection or proxy configuration. 2011-04-09 18:06:01 World Community Grid Computation for task dg01_c002_pr56b1_0 finished 2011-04-09 18:06:01 World Community Grid Starting E201784_622_C.22.C20H14N2.00010713.3.set1d06_0 2011-04-09 18:06:01 World Community Grid Starting task E201784_622_C.22.C20H14N2.00010713.3.set1d06_0 using cep2 version 640 2011-04-09 18:06:03 World Community Grid Started upload of dg01_c002_pr56b1_0_0 2011-04-09 18:06:03 World Community Grid Started upload of dg01_c002_pr56b1_0_1 2011-04-09 18:06:10 World Community Grid Finished upload of dg01_c002_pr56b1_0_0 2011-04-09 18:06:10 World Community Grid Started upload of dg01_c002_pr56b1_0_2 2011-04-09 18:06:11 World Community Grid Finished upload of dg01_c002_pr56b1_0_1 2011-04-09 18:06:11 World Community Grid Finished upload of dg01_c002_pr56b1_0_2 |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
Somewhere on these forums there is a discussion about this problem, but I can't find it either. :( What I would do is this: Open a terminal and 'cd' to the boinc directory. (On my system it is /var/lib/boinc.) grep -A 2 "<upload_when_present/>" client_state.xml See if there is anything funny about any of the URLs listed. Sometimes the word "handler" (in the URL) has been corrupted to "handlrr" or "hanndlr" or some other variation. If this is the problem, before editing client_state.xml you must shut down Boinc -- otherwise the file can get horribly corrupted. If everything looks OK with the URLs, I would try 'pinging' each of the listed servers, e.g.:- ping -c 3 http://boinc1.coas.oregonstate.edu Hope this helps. EDIT: There is some discussion that might be relevant on the PHPBB message board, here EDIT 2: If you need to change client_state.xml, do NOT change anything between <signed_xml> and </signed_xml> Only change the URL that is above the <signed_xml> line for each file. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Somehow the client.state.xml gets these missplllllng problem. handddlr handler handlrr whatever. Perhaps the new crew might try some kind of spell checker? Me, I have done at least 30- or 50 spelling fixes in the client_state.xml in the last two years. Burbblmm--qqb;eep. I mean glorbb sneeel pp. Actually -- how can we trust the "scientists" when they can't spell ? Probably they get all the models more or less right, give or take a mpb or bom or snoo; or whatever. So -- please explain how easy it is to get the name wrong but get the science right? == Eh? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
This problem only happens with linux systems, and there's nothing wrong with the data when it leaves Oxford. It's only when it arrives on certain Linux based computers that it gets messed up. And I think that it also only happens with the PNW versions of the hadam3p models. And the researchers / scientists / climatologists DON'T create the files that get sent to people's computers; this is done by the project people, using scripts that create the necessary files, according to the parameter specifications of the researchers. Milo has spent a lot of time searching the scripts, and the files still in the data pool, to try and find where this is happening, without success. Backups: Here |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Good to know that it only affects linux systems. But what a weird glitch it is. Only seems to affect 2 or 3 chars out of the whole works. I will continue to process work-units from Climateprediction -- at least until the cows come home. No worries about the science. But what a strange anomaly. Me have no clue how this xml data gets wrong. Very strange indeed. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,842,730 RAC: 5,006 |
What might be helpful, though rather onerous, is if some Linux user were to: (a) download a PNW and suspend it before it starts (the download will still complete) (b) make a backup of client_state.xml (c) run the model Repeat ad nauseam until there's an upload failure, then compare the current file with the backup. This would at least confirm what we suppose - that the corruption happens at the client end. (Or it might show the corruption happens before the model starts.) |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
If can catch a pnw download will try to to this Make one thing perfectly clear -- the spelling problem has to do with the BOINC infrastructure. Not with the climate models. |
Send message Joined: 23 Oct 05 Posts: 22 Credit: 526,746 RAC: 0 |
Thank you for the replies. I ran grep on the client_state.xml file. I couldn't find any misspelt file_upload_handler. All of the references (and there were many) were to kraken so I pinged kraken from the server, and it worked perfectly. I made the client_state file available here: http://staffannilsson.eu/Unrelated/client_state.xml I'm at loss how to proceed. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The model type mentioned in your manager listing is for FAMOUS, not a PNW model type. So none of this discussion should apply. It's just a red herring. (And the message associated with the spelling problem is something like: file handler is missing.) But what IS in the list is: 2011-04-09 17:05:47 BOINC can't access Internet - check network connection or proxy configuration.. So at the time the list was created, the problem was that your computer couldn't get to the internet. And as it's been happening to you for a while, then you'd have to look earlier in the messages for other reasons for the upload failures. One possibility is described in this by Thyme Lawn. There's another type of BOINC problem, also discussed by Thyme Lawn, where the large cpdn zips can cause a 'log jam' if a large number of them build up, and there are also files from other projects in the transfers queue. This post would probably be from January / February, when the servers were having a problem. The cure involves a few lines to be inserted into cc_config.xml, to limit the number of simultaneous files that BOINC is allowed to try during it's upload attempts. Backups: Here |
Send message Joined: 23 Oct 05 Posts: 22 Credit: 526,746 RAC: 0 |
Yes, there is the message about not being able to access the internet. It still appears after every attempt to upload a climateprediction file, but at the same time all other project communicate just fine. If I use ssh to log onto the server, I see that I can access the internet from it (and even ping kraken as previously mentioned). There are now 31 files waiting to upload, all from climateprediction. |
Send message Joined: 23 Oct 05 Posts: 22 Credit: 526,746 RAC: 0 |
The files are uploading now. All that was needed was to restart the daemon: sudo /etc/init.d/boinc-client restart No idea why it was necessary, but happy that it worked. Thanks for the help |
Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,518,815 RAC: 17,472 |
The files are uploading now. All that was needed was to restart the daemon: You seem to be running v6.10.17. There was a bug fixed around v6.10.3x that had to do with DNS-lookup, there client would always use the same, possibly bad, ip-address. A re-start of client was the only way to fix this problem. |
Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,518,815 RAC: 17,472 |
What might be helpful, though rather onerous, is if some Linux user were to: I'll recommend one additional step: a0: Immediately after being assigned a PNW-task, suspend network, and make a backup of sched_reply_climateprediction.net.xml, before enabling network again. It's important that CPDN doesn't contact the scheduling-server again before making the backup. If there's now a mis-spelling in sched_reply* it's either a server-side-problem, a problem during transfer from the scheduling-server, or a problem made by the client in handling of the scheduler-reply. If sched_reply* had everything spelled correctly, but spelling-error shows-up in client_state.xml, it's a client-problem. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Set prefs to pnw only. Allowed new tasks and got this z25a Didn't stop download quick enough but sched_reply_climateprediction.net.xml had no misspellings and client_state.xml had <file_info> for the first 12 uploads, the 13th upload was spelled ok but goes to <url>http://climateapps1.oucs.ox.ac.uk/cgi-bin/file_upload_handler</url> Tried again with this z25h This time stopped network before first file downloaded, got exact same results in both sched_reply and client.state -- 12 instances of 'hnndler' in client.state and no obvious errors in sched_reply. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,842,730 RAC: 5,006 |
That confirms that the problem isn't at the server end; at least it isn't immediately a server spelling error - there might conceivably be some wrong context that causes the errononeous re-write at the client. It does look rather like an off-by-one or buffer flushing error somewhere, with the 'n' creeping backward twice in the 'hnndler' case (there are other spelling errors too). I guess the next step is to find where/when the rewrite happens. I assume the science application doesn't rewrite client_state.xml, whereas BOINC Manager does. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
The BOINC core client does the updating, merging the data from sched_reply_climateprediction.net.xml into client_state.xml. I can't see anything in the code which would account for the corruption. The 1024 character working buffer is definitely long enough and the only modifications made outside the <signed_xml> block are deletion of leading and trailing spaces and decoding of XML escape strings; neither apply to the <url> tag. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,518,815 RAC: 17,472 |
Tried again with this z25h Ok, so the problem is clearly on the client-side of things, and not on server-side or during communication to client. This is atleast a starting-point to try tracking down the problem... Looking on my own pnw-tasks, it seems pwn is the only CPDN-tasks that uses http://boinc1.coas.oregonstate.edu/ as upload-server, but why this should have any effects seems strange. The only other difference that seems to be present is the upload-handler is at /cpdn_cgi_main/ and the other CPDN-URL's is both shorter here and has only one _ or - so maybe this has any effects even it really shouldn't... Then it comes to the size, /cpdn_cgi_main/ is total 13 letters, while some BOINC-projects uses longer. Example. SIMAP uses /boincsimap_cgi/ at 14 letters, while Einstein@home for the non-Arecibo-tasks uses /EinsteinAtHome_cgi/ meaning 18 letters. SIMAP has also longer total URL-length than the PNW-models, in case this has any meaning. So, while I know this is probably not the reason for corruption, but could you also try to attach to SIMAP and Einstein@home, and see if you gets URL-corruption from these projects also? |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I've seen this error a few times on my Core i7 920 in Fedora 13 64 bit. That PC has never had a PNW model, so I don't think that can be a common factor to all these. In fact, it's only run SAF for less than a month, and I don't think I've had that error in the last month. Before that, it only ran FAMOUS and hadsm3 models for the previous year. |
Send message Joined: 28 Mar 11 Posts: 35 Credit: 82,588 RAC: 0 |
My feeling is to agree that it is not merely a PNW issue. I have found references to malformed URLs on more than one upload server. I think the best strategy would be for the CPDN sysadmins to search the apache logs on our servers and try to find out which models are failing, and use them to pull out the info about the client types that are failing. I do agree that it would be good to find out whether this happens with other BOINC projects. ...so I have another item on my to-do list! Jonathan CPDN SysAdmin |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I upgraded two hosts from 6.10.56 to 6.10.58 Also have at least one running 6.6,40 Got 3 pnw with no errors before the recent outage. Will try for more soon. Beginning to think might be a problem in glibc since Windows hosts don't seem to be seeing this problem found in old logs what might be similar back last October on Beta Famous. Unfortunately don't have details or files any more. Is this possibly a problem for Uli Drepper and the glibc crew? Or? Any advice on how to to trap this problem welcome here. Any ideas on BOINC versions or linux versions or glibc versions? |
©2024 cpdn.org