climateprediction.net (CPDN) home page
Thread 'Output file absent & Too many errors (may have bug)'

Thread 'Output file absent & Too many errors (may have bug)'

Message boards : Number crunching : Output file absent & Too many errors (may have bug)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
skgiven
Avatar

Send message
Joined: 5 Jun 06
Posts: 28
Credit: 2,790,048
RAC: 0
Message 44562 - Posted: 22 Jul 2012, 10:49:56 UTC
Last modified: 22 Jul 2012, 10:52:57 UTC

Output file absent:

22/07/2012 10:38:50 | climateprediction.net | Computation for task hadam3p_eu_634j_2009_1_008071304_2 finished
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_2.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_3.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_4.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_5.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_6.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_7.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_8.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_9.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_10.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_11.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_12.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent

14973021 8226418 1212547 22 Jul 2012 0:47:15 UTC 22 Jul 2012 10:30:11 UTC Error while computing 26,180.15 25,922.02 0.00 --- UK Met Office HADAM3P European Region v6.09

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<stderr_txt>

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

-161 is a File Not Found error.

My System.

My Task

The WorkUnit

Notes. The Ethernet to Internet connection was disconnected at the time. Also running POEM (GPU), RNA world and yoyo tasks. Only 4 CPU threads used (due to POEM requirements/setup). Write to disk @900sec. No other system or Boinc issues.
ID: 44562 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 44563 - Posted: 22 Jul 2012, 11:26:11 UTC - in response to Message 44562.  

Yes, I've seen maybe a half-dozen of these in the last few weeks. Mal-formed tasks that have been automatically re-issued but won't ever work because of the
"REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH"
Jut let them die and don't worry it.
ID: 44563 · Report as offensive     Reply Quote
skgiven
Avatar

Send message
Joined: 5 Jun 06
Posts: 28
Credit: 2,790,048
RAC: 0
Message 44564 - Posted: 22 Jul 2012, 13:03:15 UTC - in response to Message 44563.  
Last modified: 22 Jul 2012, 13:04:01 UTC

Thanks for the confirmation. This sort of issue occurs at other projects too, usually when the researchers make a mistake when building the tasks, but was also caused by deprecated clients for auto-generated tasks.

Might it be possible/worth while to do an early trickle point, or add a file check routine, in order to reduce the loss in such situations; so they would fail earlier, rather than say after 10h?
ID: 44564 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,405,498
RAC: 10,268
Message 44565 - Posted: 22 Jul 2012, 13:33:56 UTC

A quick check shows six out of 330 AM3P that I have run this year on three PCs (two XP, one Linux) have zonked out with an error, including this 'output file absent'. That's less than 2% attrition rate, which is very low compared to the much higher attrition rates on the longer models.

(I lost an AM3p and a CM3 yesterday to a very short power brownout that caused one PC and the internet router to reboot. The other PC, two laptops, monitors and a printer didn't blink.)

For an ensemble methodology, 2% attrition rate is probably not worth the effort of delving further into the reasons for the error. I simply accept there will be an attrition rate.
ID: 44565 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 44566 - Posted: 22 Jul 2012, 17:09:41 UTC - in response to Message 44563.  

Thanks, saves me searching for answers, I had two pnw tasks go like this for me yesterday, though there was a power cut involved as well so I can't be 100% sure of the cause.

Any typing errors due to not being used to the tiny netbook keyboard. - Atom slowly making it's way through two eu units. I will have to get the extra GB of memory to see if it makes any difference.
ID: 44566 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 44567 - Posted: 22 Jul 2012, 17:55:28 UTC

This REPLANCA thing is an error in the model. It happened a few months ago so we need to check whether there's a new batch of models with the same problem. It looks as if the headers on ancillary files don't match:

http://cms.ncas.ac.uk/trac/UMHelpdesk/ticket/399

It's a real nuisance that the web pages for these regional models take ages to open up so it's not easy to see what's happening with different WUs.
Cpdn news
ID: 44567 · Report as offensive     Reply Quote
skgiven
Avatar

Send message
Joined: 5 Jun 06
Posts: 28
Credit: 2,790,048
RAC: 0
Message 44568 - Posted: 22 Jul 2012, 20:06:53 UTC - in response to Message 44567.  
Last modified: 22 Jul 2012, 20:09:51 UTC

From WU 8226400 to 8226430 there are 15 failed tasks, several have failed more than once, none have reported successfully.
All are UK Met Office HADAM3P European Region and all were created at around the same time (20 Jul 2012 5:50:00 to 5:59:00 UTC)

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226418
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226422
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226419
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226418
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226417
ID: 44568 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 44569 - Posted: 24 Jul 2012, 0:26:34 UTC

I can't get the task pages to open for me at all, even after hours. I can only look at the WU and computer pages. So I can't see whether all the computers are crashing the models with the same error. (I'm discounting computers that can't run any climate models at all and need to have their daily quota minussed until their owners put things right.)

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226420

Now why can this computer with Windows complete one of this batch of models?



Cpdn news
ID: 44569 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 44570 - Posted: 24 Jul 2012, 0:51:17 UTC

I've found some Windows machines with the error and two that have now completed their model. There's a single Mac that seems to be crunching one OK. All the other Macs I've found are crashing everything with the usual problem.


Cpdn news
ID: 44570 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 44571 - Posted: 24 Jul 2012, 1:25:37 UTC
Last modified: 24 Jul 2012, 1:25:59 UTC

I wonder whether something else unrelated (?) to the REPLANCA error is going on with the EU models. Look at Paolo's computer and its tasks.

It can process Hadcm, Hadam PNW and Hadam SA nicely. But it crashes every Hadam EU in less than a minute as if the computer was misconfigured. These can't all be REPLANCA crashes.
Cpdn news
ID: 44571 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 44572 - Posted: 24 Jul 2012, 13:05:38 UTC

Ah yes, Replanca. I gambled away a small fortune at its beach-side casinos; where I wined and dined an Italian woman whose name I cannot remember....

Where was I? Oh yes, task 14903295 a PNW, just turned this error up at around 98% completion. I have another PNW finishing up shortly, we'll see what happens.
ID: 44572 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 44573 - Posted: 24 Jul 2012, 15:15:56 UTC
Last modified: 24 Jul 2012, 15:20:02 UTC

No, there is no Replanca ..., nor Italian women whose names I cannot remember for that matter. Just sounded like an exotic place name, like Pollenca or Menorca ;)

Edit: my other PNW finished fine.
ID: 44573 · Report as offensive     Reply Quote
skgiven
Avatar

Send message
Joined: 5 Jun 06
Posts: 28
Credit: 2,790,048
RAC: 0
Message 44574 - Posted: 24 Jul 2012, 22:20:08 UTC - in response to Message 44571.  
Last modified: 24 Jul 2012, 23:10:24 UTC

Paolo's Hadam EU tasks on that computer are all crashing with an exit status of -2:

Outcome Client error
Client state Compute error
Exit status -2 (0xfffffffffffffffe)

I think this is an issue with the task or app and nothing to do with Windows, Boinc, manager or client or other apps.


Some of Paolo's other computers are failing due to the REPLANCA issue with Exit status 0, error_code -161 (file_xfer_error):

Exit status 0 (0x0)
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH

Some of these don't seem to run (Error while downloading) but others do run (file_xfer_error):

14947580 8213025 19 Jul 2012 17:57:28 UTC 21 Jul 2012 3:16:32 UTC Error while computing 102,580.61 100,825.80 399.11 399.11 UK Met Office HADAM3P European Region v6.09

In this case could the trickle result in a failure (file_xfer_error) and this in turn cause the task to be killed, and could all this be linked to the servers availability/responsiveness (pages not loading)?

- More likely one of the ranges is out!
ID: 44574 · Report as offensive     Reply Quote
[boinc.at] Nowi

Send message
Joined: 16 Jul 05
Posts: 32
Credit: 10,513,155
RAC: 0
Message 44577 - Posted: 25 Jul 2012, 11:49:33 UTC

I have the Replanca problem, too. Four models in a row have a computation error after about 13000 s of computation time.
ID: 44577 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44578 - Posted: 26 Jul 2012, 0:28:29 UTC

Lots of people seem to be getting this. I'm up to my 4th or 5th failure. :(

Information that would be useful:
The actual name of the failed model.
Roughly when it failed.
If you have noticed a mysterious "zip 13" file has been created.

e.g. For one of mine:
hadam3p_eu_8aow_2005_1_008058020_0
REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH

This was between zips 2 and 3, at 25 hours 47 minutes 39 seconds, and a zip 13 was created.

Backups: Here
ID: 44578 · Report as offensive     Reply Quote
[boinc.at] Nowi

Send message
Joined: 16 Jul 05
Posts: 32
Credit: 10,513,155
RAC: 0
Message 44580 - Posted: 26 Jul 2012, 8:38:20 UTC - in response to Message 44578.  

Here are my failed WU. All failed with Replanca in the stderr.out

hadam3p_eu_cqxv_2000_1_008083091_2 zip1, zip13 uploaded
hadam3p_eu_ctbo_2009_1_008084522_1 zip1, zip13 uploaded
hadam3p_eu_ctx1_2008_1_008084858_0 zip1, zip13 uploaded
hadam3p_eu_a74l_1990_1_008067608_1 crashed after 8.79 s no zips uploaded
hadam3p_eu_ct79_2004_1_008084440_0 zip1, zip13 uploaded
hadam3p_eu_csgf_2006_1_008083996_0 zip1, zip13 uploaded
hadam3p_eu_crlu_2005_1_008083482_0 zip1, zip13 uploaded
hadam3p_eu_cr5j_2001_1_008083225_0 zip1, zip13 uploaded

I hope that will help.


ID: 44580 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 44581 - Posted: 26 Jul 2012, 9:05:53 UTC

I don't know if this info. is useful for comparison/investigative purposes - but just in case...
One of my computers (ID: 1142892 ) has been running tasks of this model successfully for a while, the latest (Task ID 8210373) successfully completing yesterday. The previous run was Task ID 14734712, which completed successfully on 31st May.
ID: 44581 · Report as offensive     Reply Quote
Nigel Garvey

Send message
Joined: 5 May 10
Posts: 69
Credit: 1,169,103
RAC: 2,258
Message 44582 - Posted: 26 Jul 2012, 9:09:45 UTC - in response to Message 44578.  

hadam3p_eu_cqgw_2005_1_008082804_0 and hadam3p_eu_cqgu_2003_1_008082803_0. Downloaded at 09:14 BST yesterday and run in parallel from then until they apparently "completed" within seconds of each other at getting on for 01:00 this morning. Files _2 to _12 were reported missing and there was indeed a file _13 apparently waiting to be uploaded when network activity resumed. I only remember there being one such _13 file, but I wasn't paying particular attention at the time. Although supposedly several MB in size, it disappeared instantly from the Transfers window when the BOINC client contacted the server.

"REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH" error in both cases.

Mac OS 10.6.8. BOINC 7.0.28.


NG
ID: 44582 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 44583 - Posted: 26 Jul 2012, 10:07:26 UTC

Dave, the two models you mentioned were sent to you on 23 May and 17 July so they were from earlier batches of EU models. The batch generating so many REPLANCA errors was I think generated starting on 22 July. I still can't get any task pages for the regional models to open up though so I can't check what I say from the stderr files of crashed models.


Cpdn news
ID: 44583 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 44586 - Posted: 26 Jul 2012, 10:14:06 UTC
Last modified: 26 Jul 2012, 10:18:05 UTC

Re my previous post on successful completions - I've just had a look at the messages and found the following, regarding successful uploads of zip 13 files after successful uploads of zips 1-12.

Wed Jul 25 22:34:37 2012 climateprediction.net Started upload of hadam3p_eu_9xz6_1991_1_008055259_0_12.zip
Wed Jul 25 22:39:48 2012 climateprediction.net Finished upload of hadam3p_eu_9xz6_1991_1_008055259_0_12.zip
Wed Jul 25 22:52:59 2012 climateprediction.net Started upload of hadam3p_eu_9xz6_1991_1_008055259_0_13.zip
Wed Jul 25 22:53:02 2012 climateprediction.net Computation for task hadam3p_eu_9xz6_1991_1_008055259_0 finished
Wed Jul 25 22:53:03 2012 climateprediction.net Starting hadam3p_eu_ctxm_2007_1_008084866_0
Wed Jul 25 22:53:03 2012 climateprediction.net Starting task hadam3p_eu_ctxm_2007_1_008084866_0 using hadam3p_eu version 609
Wed Jul 25 23:05:50 2012 climateprediction.net Finished upload of hadam3p_eu_9xz6_1991_1_008055259_0_13.zip

mo. v - Was preparing this before I saw your post.
ID: 44586 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Output file absent & Too many errors (may have bug)

©2024 cpdn.org