climateprediction.net (CPDN) home page
Thread 'Upload problems'

Thread 'Upload problems'

Message boards : Number crunching : Upload problems
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
staffann

Send message
Joined: 23 Oct 05
Posts: 22
Credit: 526,746
RAC: 0
Message 41939 - Posted: 9 Apr 2011, 17:28:59 UTC

I noticed that my 24/7 linux server had not gotten any credits from climateprediction for a while, so I thought I'd look and see what's happening. It seems it cannot upload result files despite the server status on you page saying ok. The messages from boinc are posted below. As you can see, other projects communicate just fine. Any suggestions? Thanks!

2011-04-09 16:48:01	rosetta@home	Reporting 1 completed tasks, not requesting new tasks
2011-04-09 16:48:02	rosetta@home	Started upload of mem_tid3_run06_A_1afo_SAVE_ALL_OUT_IGNORE_THE_REST_22930_14947_0_0
2011-04-09 16:48:06	rosetta@home	Scheduler request completed
2011-04-09 16:48:09	rosetta@home	Finished upload of mem_tid3_run06_A_1afo_SAVE_ALL_OUT_IGNORE_THE_REST_22930_14947_0_0
2011-04-09 16:48:11	World Community Grid	Sending scheduler request: To fetch work.
2011-04-09 16:48:11	World Community Grid	Reporting 3 completed tasks, requesting new tasks
2011-04-09 16:48:16	World Community Grid	Scheduler request completed: got 1 new tasks
2011-04-09 16:48:18	World Community Grid	Started download of E201784_622_C.22.C20H14N2.00010713.3.set1d06_C.22.C20H14N2.00010713.3.zip
2011-04-09 16:48:21	World Community Grid	Finished download of E201784_622_C.22.C20H14N2.00010713.3.set1d06_C.22.C20H14N2.00010713.3.zip
2011-04-09 17:05:00	climateprediction.net	Started upload of famous_xaby_1999_200_007075221_2_10.zip
2011-04-09 17:05:00	climateprediction.net	Started upload of famous_voj9_999_200_006734889_5_2.zip
2011-04-09 17:05:24		Project communication failed: attempting access to reference site
2011-04-09 17:05:24	climateprediction.net	Temporarily failed upload of famous_xaby_1999_200_007075221_2_10.zip: connect() failed
2011-04-09 17:05:24	climateprediction.net	Backing off 1 hr 38 min 10 sec on upload of famous_xaby_1999_200_007075221_2_10.zip
2011-04-09 17:05:24	climateprediction.net	Temporarily failed upload of famous_voj9_999_200_006734889_5_2.zip: connect() failed
2011-04-09 17:05:24	climateprediction.net	Backing off 2 hr 42 min 44 sec on upload of famous_voj9_999_200_006734889_5_2.zip
2011-04-09 17:05:47		BOINC can't access Internet - check network connection or proxy configuration.
2011-04-09 18:06:01	World Community Grid	Computation for task dg01_c002_pr56b1_0 finished
2011-04-09 18:06:01	World Community Grid	Starting E201784_622_C.22.C20H14N2.00010713.3.set1d06_0
2011-04-09 18:06:01	World Community Grid	Starting task E201784_622_C.22.C20H14N2.00010713.3.set1d06_0 using cep2 version 640
2011-04-09 18:06:03	World Community Grid	Started upload of dg01_c002_pr56b1_0_0
2011-04-09 18:06:03	World Community Grid	Started upload of dg01_c002_pr56b1_0_1
2011-04-09 18:06:10	World Community Grid	Finished upload of dg01_c002_pr56b1_0_0
2011-04-09 18:06:10	World Community Grid	Started upload of dg01_c002_pr56b1_0_2
2011-04-09 18:06:11	World Community Grid	Finished upload of dg01_c002_pr56b1_0_1
2011-04-09 18:06:11	World Community Grid	Finished upload of dg01_c002_pr56b1_0_2

ID: 41939 · Report as offensive     Reply Quote
ProfileGreg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 41941 - Posted: 9 Apr 2011, 21:22:21 UTC - in response to Message 41939.  
Last modified: 9 Apr 2011, 21:45:43 UTC

Somewhere on these forums there is a discussion about this problem, but I can't find it either. :(

What I would do is this: Open a terminal and 'cd' to the boinc directory. (On my system it is /var/lib/boinc.)

grep -A 2 "<upload_when_present/>" client_state.xml

See if there is anything funny about any of the URLs listed. Sometimes the word "handler" (in the URL) has been corrupted to "handlrr" or "hanndlr" or some other variation.

If this is the problem, before editing client_state.xml you must shut down Boinc -- otherwise the file can get horribly corrupted.

If everything looks OK with the URLs, I would try 'pinging' each of the listed servers, e.g.:-

ping -c 3 http://boinc1.coas.oregonstate.edu

Hope this helps.

EDIT: There is some discussion that might be relevant on the PHPBB message board, here

EDIT 2: If you need to change client_state.xml, do NOT change anything between
<signed_xml> and

</signed_xml>

Only change the URL that is above the <signed_xml> line for each file.
ID: 41941 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 41954 - Posted: 10 Apr 2011, 9:48:41 UTC

Somehow the client.state.xml gets these missplllllng problem.
handddlr handler handlrr whatever.
Perhaps the new crew might try some kind of spell checker?
Me, I have done at least 30- or 50 spelling fixes in the client_state.xml in the last two years. Burbblmm--qqb;eep.
I mean glorbb sneeel pp.
Actually -- how can we trust the "scientists" when they can't spell ?
Probably they get all the models more or less right, give or take a mpb or bom or snoo; or whatever.
So -- please explain how easy it is to get the name wrong but get the science right? == Eh?
ID: 41954 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41955 - Posted: 10 Apr 2011, 10:19:32 UTC - in response to Message 41954.  

This problem only happens with linux systems, and there's nothing wrong with the data when it leaves Oxford.
It's only when it arrives on certain Linux based computers that it gets messed up.
And I think that it also only happens with the PNW versions of the hadam3p models.

And the researchers / scientists / climatologists DON'T create the files that get sent to people's computers; this is done by the project people, using scripts that create the necessary files, according to the parameter specifications of the researchers.

Milo has spent a lot of time searching the scripts, and the files still in the data pool, to try and find where this is happening, without success.


Backups: Here
ID: 41955 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 41956 - Posted: 10 Apr 2011, 10:53:43 UTC - in response to Message 41955.  

Good to know that it only affects linux systems. But what a weird glitch it is. Only seems to affect 2 or 3 chars out of the whole works.
I will continue to process work-units from Climateprediction -- at least until the cows come home. No worries about the science. But what a strange anomaly.
Me have no clue how this xml data gets wrong.
Very strange indeed.

ID: 41956 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,827,799
RAC: 5,038
Message 41957 - Posted: 10 Apr 2011, 11:11:29 UTC

What might be helpful, though rather onerous, is if some Linux user were to:

(a) download a PNW and suspend it before it starts (the download will still complete)

(b) make a backup of client_state.xml

(c) run the model

Repeat ad nauseam until there's an upload failure, then compare the current file with the backup. This would at least confirm what we suppose - that the corruption happens at the client end. (Or it might show the corruption happens before the model starts.)
ID: 41957 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 41958 - Posted: 10 Apr 2011, 11:22:20 UTC - in response to Message 41955.  

If can catch a pnw download will try to to this

Make one thing perfectly clear -- the spelling problem has to do with the BOINC infrastructure. Not with the climate models.
ID: 41958 · Report as offensive     Reply Quote
staffann

Send message
Joined: 23 Oct 05
Posts: 22
Credit: 526,746
RAC: 0
Message 41959 - Posted: 10 Apr 2011, 12:47:27 UTC

Thank you for the replies. I ran grep on the client_state.xml file. I couldn't find any misspelt file_upload_handler. All of the references (and there were many) were to kraken so I pinged kraken from the server, and it worked perfectly.

I made the client_state file available here: http://staffannilsson.eu/Unrelated/client_state.xml

I'm at loss how to proceed.
ID: 41959 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41960 - Posted: 10 Apr 2011, 16:11:00 UTC - in response to Message 41959.  

The model type mentioned in your manager listing is for FAMOUS, not a PNW model type.
So none of this discussion should apply. It's just a red herring.
(And the message associated with the spelling problem is something like: file handler is missing.)

But what IS in the list is: 2011-04-09 17:05:47 BOINC can't access Internet - check network connection or proxy configuration..

So at the time the list was created, the problem was that your computer couldn't get to the internet.
And as it's been happening to you for a while, then you'd have to look earlier in the messages for other reasons for the upload failures.

One possibility is described in this by Thyme Lawn.
There's another type of BOINC problem, also discussed by Thyme Lawn, where the large cpdn zips can cause a 'log jam' if a large number of them build up, and there are also files from other projects in the transfers queue.

This post would probably be from January / February, when the servers were having a problem. The cure involves a few lines to be inserted into cc_config.xml, to limit the number of simultaneous files that BOINC is allowed to try during it's upload attempts.


Backups: Here
ID: 41960 · Report as offensive     Reply Quote
staffann

Send message
Joined: 23 Oct 05
Posts: 22
Credit: 526,746
RAC: 0
Message 41962 - Posted: 10 Apr 2011, 19:26:58 UTC - in response to Message 41960.  
Last modified: 10 Apr 2011, 19:33:14 UTC

Yes, there is the message about not being able to access the internet. It still appears after every attempt to upload a climateprediction file, but at the same time all other project communicate just fine. If I use ssh to log onto the server, I see that I can access the internet from it (and even ping kraken as previously mentioned).

There are now 31 files waiting to upload, all from climateprediction.
ID: 41962 · Report as offensive     Reply Quote
staffann

Send message
Joined: 23 Oct 05
Posts: 22
Credit: 526,746
RAC: 0
Message 41963 - Posted: 10 Apr 2011, 19:49:38 UTC - in response to Message 41962.  
Last modified: 10 Apr 2011, 19:50:03 UTC

The files are uploading now. All that was needed was to restart the daemon:
sudo /etc/init.d/boinc-client restart

No idea why it was necessary, but happy that it worked.
Thanks for the help
ID: 41963 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,498,085
RAC: 21,454
Message 41965 - Posted: 10 Apr 2011, 23:22:31 UTC - in response to Message 41963.  

The files are uploading now. All that was needed was to restart the daemon:
sudo /etc/init.d/boinc-client restart

No idea why it was necessary, but happy that it worked.
Thanks for the help

You seem to be running v6.10.17. There was a bug fixed around v6.10.3x that had to do with DNS-lookup, there client would always use the same, possibly bad, ip-address. A re-start of client was the only way to fix this problem.

ID: 41965 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,498,085
RAC: 21,454
Message 41966 - Posted: 10 Apr 2011, 23:40:28 UTC - in response to Message 41957.  

What might be helpful, though rather onerous, is if some Linux user were to:

(a) download a PNW and suspend it before it starts (the download will still complete)

(b) make a backup of client_state.xml

(c) run the model

Repeat ad nauseam until there's an upload failure, then compare the current file with the backup. This would at least confirm what we suppose - that the corruption happens at the client end. (Or it might show the corruption happens before the model starts.)

I'll recommend one additional step:

a0: Immediately after being assigned a PNW-task, suspend network, and make a backup of sched_reply_climateprediction.net.xml, before enabling network again.

It's important that CPDN doesn't contact the scheduling-server again before making the backup.


If there's now a mis-spelling in sched_reply* it's either a server-side-problem, a problem during transfer from the scheduling-server, or a problem made by the client in handling of the scheduler-reply.

If sched_reply* had everything spelled correctly, but spelling-error shows-up in client_state.xml, it's a client-problem.
ID: 41966 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 41971 - Posted: 12 Apr 2011, 1:56:15 UTC - in response to Message 41966.  

Set prefs to pnw only.
Allowed new tasks and got this z25a
Didn't stop download quick enough but sched_reply_climateprediction.net.xml had no misspellings and client_state.xml had
<file_info>
<name>hadam3p_pnw_z25a_2005_1_006914102_1_1.zip</name>
<nbytes>0.000000</nbytes>
<max_nbytes>150000000.000000</max_nbytes>
<generated_locally/>
<status>0</status>
<upload_when_present/>
<url>http://boinc1.coas.oregonstate.edu/cpdn_cgi_main/file_upload_hnndler</url>
<signed_xml>
<name>hadam3p_pnw_z25a_2005_1_006914102_1_1.zip</name>
<generated_locally/>
<upload_when_present/>
<max_nbytes>150000000</max_nbytes>
<url> http://boinc1.coas.oregonstate.edu/cpdn_cgi_main/file_upload_handler </url>
</signed_xml>

for the first 12 uploads, the 13th upload was spelled ok but goes to
<url>http://climateapps1.oucs.ox.ac.uk/cgi-bin/file_upload_handler</url>


Tried again with this z25h
This time stopped network before first file downloaded, got exact same results in both sched_reply and client.state -- 12 instances of 'hnndler' in client.state and no obvious errors in sched_reply.
ID: 41971 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,827,799
RAC: 5,038
Message 41974 - Posted: 12 Apr 2011, 14:03:52 UTC

That confirms that the problem isn't at the server end; at least it isn't immediately a server spelling error - there might conceivably be some wrong context that causes the errononeous re-write at the client.

It does look rather like an off-by-one or buffer flushing error somewhere, with the 'n' creeping backward twice in the 'hnndler' case (there are other spelling errors too). I guess the next step is to find where/when the rewrite happens. I assume the science application doesn't rewrite client_state.xml, whereas BOINC Manager does.
ID: 41974 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 41975 - Posted: 12 Apr 2011, 14:16:19 UTC - in response to Message 41974.  

The BOINC core client does the updating, merging the data from sched_reply_climateprediction.net.xml into client_state.xml.

I can't see anything in the code which would account for the corruption. The 1024 character working buffer is definitely long enough and the only modifications made outside the <signed_xml> block are deletion of leading and trailing spaces and decoding of XML escape strings; neither apply to the <url> tag.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 41975 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,498,085
RAC: 21,454
Message 41978 - Posted: 12 Apr 2011, 21:10:21 UTC - in response to Message 41971.  
Last modified: 12 Apr 2011, 21:16:26 UTC

Tried again with this z25h
This time stopped network before first file downloaded, got exact same results in both sched_reply and client.state -- 12 instances of 'hnndler' in client.state and no obvious errors in sched_reply.

Ok, so the problem is clearly on the client-side of things, and not on server-side or during communication to client. This is atleast a starting-point to try tracking down the problem...

Looking on my own pnw-tasks, it seems pwn is the only CPDN-tasks that uses http://boinc1.coas.oregonstate.edu/ as upload-server, but why this should have any effects seems strange.

The only other difference that seems to be present is the upload-handler is at /cpdn_cgi_main/ and the other CPDN-URL's is both shorter here and has only one _ or - so maybe this has any effects even it really shouldn't...

Then it comes to the size, /cpdn_cgi_main/ is total 13 letters, while some BOINC-projects uses longer. Example. SIMAP uses /boincsimap_cgi/ at 14 letters, while Einstein@home for the non-Arecibo-tasks uses /EinsteinAtHome_cgi/ meaning 18 letters. SIMAP has also longer total URL-length than the PNW-models, in case this has any meaning.


So, while I know this is probably not the reason for corruption, but could you also try to attach to SIMAP and Einstein@home, and see if you gets URL-corruption from these projects also?
ID: 41978 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 41980 - Posted: 12 Apr 2011, 22:48:41 UTC

I've seen this error a few times on my Core i7 920 in Fedora 13 64 bit. That PC has never had a PNW model, so I don't think that can be a common factor to all these. In fact, it's only run SAF for less than a month, and I don't think I've had that error in the last month. Before that, it only ran FAMOUS and hadsm3 models for the previous year.
ID: 41980 · Report as offensive     Reply Quote
Profileold_user651284

Send message
Joined: 28 Mar 11
Posts: 35
Credit: 82,588
RAC: 0
Message 41988 - Posted: 21 Apr 2011, 8:01:36 UTC - in response to Message 41980.  

My feeling is to agree that it is not merely a PNW issue.
I have found references to malformed URLs on more than one upload server.

I think the best strategy would be for the CPDN sysadmins to search the apache logs on our servers and try to find out which models are failing, and use them to pull out the info about the client types that are failing.

I do agree that it would be good to find out whether this happens with other BOINC projects.

...so I have another item on my to-do list!

Jonathan
CPDN SysAdmin
ID: 41988 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 41989 - Posted: 21 Apr 2011, 9:09:50 UTC

I upgraded two hosts from 6.10.56 to 6.10.58
Also have at least one running 6.6,40
Got 3 pnw with no errors before the recent outage.
Will try for more soon.
Beginning to think might be a problem in glibc since Windows hosts don't seem to be seeing this problem
found in old logs what might be similar back last October on Beta Famous. Unfortunately don't have details or files any more.
Is this possibly a problem for Uli Drepper and the glibc crew?
Or?
Any advice on how to to trap this problem welcome here.
Any ideas on BOINC versions or linux versions or glibc versions?
ID: 41989 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Upload problems

©2024 cpdn.org