Message boards : Number crunching : Output file absent & Too many errors (may have bug)
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Jun 06 Posts: 28 Credit: 2,790,048 RAC: 0 |
Output file absent: 22/07/2012 10:38:50 | climateprediction.net | Computation for task hadam3p_eu_634j_2009_1_008071304_2 finished 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_2.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_3.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_4.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_5.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_6.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_7.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_8.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_9.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_10.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_11.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_12.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent 14973021 8226418 1212547 22 Jul 2012 0:47:15 UTC 22 Jul 2012 10:30:11 UTC Error while computing 26,180.15 25,922.02 0.00 --- UK Met Office HADAM3P European Region v6.09 <core_client_version>7.0.28</core_client_version> <![CDATA[ <stderr_txt> Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048 Leaving CPDN_Main::Monitor... Called boinc_finish </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_2.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_3.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_4.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_5.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_6.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_7.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_8.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_9.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_10.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_11.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_634j_2009_1_008071304_2_12.zip</file_name> <error_code>-161</error_code> </file_xfer_error> </message> ]]> -161 is a File Not Found error. My System. My Task The WorkUnit Notes. The Ethernet to Internet connection was disconnected at the time. Also running POEM (GPU), RNA world and yoyo tasks. Only 4 CPU threads used (due to POEM requirements/setup). Write to disk @900sec. No other system or Boinc issues. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Yes, I've seen maybe a half-dozen of these in the last few weeks. Mal-formed tasks that have been automatically re-issued but won't ever work because of the "REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH" Jut let them die and don't worry it. |
Send message Joined: 5 Jun 06 Posts: 28 Credit: 2,790,048 RAC: 0 |
Thanks for the confirmation. This sort of issue occurs at other projects too, usually when the researchers make a mistake when building the tasks, but was also caused by deprecated clients for auto-generated tasks. Might it be possible/worth while to do an early trickle point, or add a file check routine, in order to reduce the loss in such situations; so they would fail earlier, rather than say after 10h? |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,374,828 RAC: 10,749 |
A quick check shows six out of 330 AM3P that I have run this year on three PCs (two XP, one Linux) have zonked out with an error, including this 'output file absent'. That's less than 2% attrition rate, which is very low compared to the much higher attrition rates on the longer models. (I lost an AM3p and a CM3 yesterday to a very short power brownout that caused one PC and the internet router to reboot. The other PC, two laptops, monitors and a printer didn't blink.) For an ensemble methodology, 2% attrition rate is probably not worth the effort of delving further into the reasons for the error. I simply accept there will be an attrition rate. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Thanks, saves me searching for answers, I had two pnw tasks go like this for me yesterday, though there was a power cut involved as well so I can't be 100% sure of the cause. Any typing errors due to not being used to the tiny netbook keyboard. - Atom slowly making it's way through two eu units. I will have to get the extra GB of memory to see if it makes any difference. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
This REPLANCA thing is an error in the model. It happened a few months ago so we need to check whether there's a new batch of models with the same problem. It looks as if the headers on ancillary files don't match: http://cms.ncas.ac.uk/trac/UMHelpdesk/ticket/399 It's a real nuisance that the web pages for these regional models take ages to open up so it's not easy to see what's happening with different WUs. Cpdn news |
Send message Joined: 5 Jun 06 Posts: 28 Credit: 2,790,048 RAC: 0 |
From WU 8226400 to 8226430 there are 15 failed tasks, several have failed more than once, none have reported successfully. All are UK Met Office HADAM3P European Region and all were created at around the same time (20 Jul 2012 5:50:00 to 5:59:00 UTC) http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226418 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226422 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226419 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226418 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226417 |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I can't get the task pages to open for me at all, even after hours. I can only look at the WU and computer pages. So I can't see whether all the computers are crashing the models with the same error. (I'm discounting computers that can't run any climate models at all and need to have their daily quota minussed until their owners put things right.) http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226420 Now why can this computer with Windows complete one of this batch of models? Cpdn news |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I've found some Windows machines with the error and two that have now completed their model. There's a single Mac that seems to be crunching one OK. All the other Macs I've found are crashing everything with the usual problem. Cpdn news |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I wonder whether something else unrelated (?) to the REPLANCA error is going on with the EU models. Look at Paolo's computer and its tasks. It can process Hadcm, Hadam PNW and Hadam SA nicely. But it crashes every Hadam EU in less than a minute as if the computer was misconfigured. These can't all be REPLANCA crashes. Cpdn news |
Send message Joined: 19 Apr 08 Posts: 179 Credit: 4,306,992 RAC: 0 |
Ah yes, Replanca. I gambled away a small fortune at its beach-side casinos; where I wined and dined an Italian woman whose name I cannot remember.... Where was I? Oh yes, task 14903295 a PNW, just turned this error up at around 98% completion. I have another PNW finishing up shortly, we'll see what happens. |
Send message Joined: 19 Apr 08 Posts: 179 Credit: 4,306,992 RAC: 0 |
No, there is no Replanca ..., nor Italian women whose names I cannot remember for that matter. Just sounded like an exotic place name, like Pollenca or Menorca ;) Edit: my other PNW finished fine. |
Send message Joined: 5 Jun 06 Posts: 28 Credit: 2,790,048 RAC: 0 |
Paolo's Hadam EU tasks on that computer are all crashing with an exit status of -2: Outcome Client error Client state Compute error Exit status -2 (0xfffffffffffffffe) I think this is an issue with the task or app and nothing to do with Windows, Boinc, manager or client or other apps. Some of Paolo's other computers are failing due to the REPLANCA issue with Exit status 0, error_code -161 (file_xfer_error): Exit status 0 (0x0) Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH Some of these don't seem to run (Error while downloading) but others do run (file_xfer_error): 14947580 8213025 19 Jul 2012 17:57:28 UTC 21 Jul 2012 3:16:32 UTC Error while computing 102,580.61 100,825.80 399.11 399.11 UK Met Office HADAM3P European Region v6.09 In this case could the trickle result in a failure (file_xfer_error) and this in turn cause the task to be killed, and could all this be linked to the servers availability/responsiveness (pages not loading)? - More likely one of the ranges is out! |
Send message Joined: 16 Jul 05 Posts: 32 Credit: 10,513,155 RAC: 0 |
I have the Replanca problem, too. Four models in a row have a computation error after about 13000 s of computation time. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Lots of people seem to be getting this. I'm up to my 4th or 5th failure. :( Information that would be useful: The actual name of the failed model. Roughly when it failed. If you have noticed a mysterious "zip 13" file has been created. e.g. For one of mine: hadam3p_eu_8aow_2005_1_008058020_0 REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH This was between zips 2 and 3, at 25 hours 47 minutes 39 seconds, and a zip 13 was created. Backups: Here |
Send message Joined: 16 Jul 05 Posts: 32 Credit: 10,513,155 RAC: 0 |
Here are my failed WU. All failed with Replanca in the stderr.out hadam3p_eu_cqxv_2000_1_008083091_2 zip1, zip13 uploaded hadam3p_eu_ctbo_2009_1_008084522_1 zip1, zip13 uploaded hadam3p_eu_ctx1_2008_1_008084858_0 zip1, zip13 uploaded hadam3p_eu_a74l_1990_1_008067608_1 crashed after 8.79 s no zips uploaded hadam3p_eu_ct79_2004_1_008084440_0 zip1, zip13 uploaded hadam3p_eu_csgf_2006_1_008083996_0 zip1, zip13 uploaded hadam3p_eu_crlu_2005_1_008083482_0 zip1, zip13 uploaded hadam3p_eu_cr5j_2001_1_008083225_0 zip1, zip13 uploaded I hope that will help. |
Send message Joined: 15 Jan 11 Posts: 175 Credit: 6,242,691 RAC: 699 |
I don't know if this info. is useful for comparison/investigative purposes - but just in case... One of my computers (ID: 1142892 ) has been running tasks of this model successfully for a while, the latest (Task ID 8210373) successfully completing yesterday. The previous run was Task ID 14734712, which completed successfully on 31st May. |
Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258 |
hadam3p_eu_cqgw_2005_1_008082804_0 and hadam3p_eu_cqgu_2003_1_008082803_0. Downloaded at 09:14 BST yesterday and run in parallel from then until they apparently "completed" within seconds of each other at getting on for 01:00 this morning. Files _2 to _12 were reported missing and there was indeed a file _13 apparently waiting to be uploaded when network activity resumed. I only remember there being one such _13 file, but I wasn't paying particular attention at the time. Although supposedly several MB in size, it disappeared instantly from the Transfers window when the BOINC client contacted the server. "REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH" error in both cases. Mac OS 10.6.8. BOINC 7.0.28. NG |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Dave, the two models you mentioned were sent to you on 23 May and 17 July so they were from earlier batches of EU models. The batch generating so many REPLANCA errors was I think generated starting on 22 July. I still can't get any task pages for the regional models to open up though so I can't check what I say from the stderr files of crashed models. Cpdn news |
Send message Joined: 15 Jan 11 Posts: 175 Credit: 6,242,691 RAC: 699 |
Re my previous post on successful completions - I've just had a look at the messages and found the following, regarding successful uploads of zip 13 files after successful uploads of zips 1-12. Wed Jul 25 22:34:37 2012 climateprediction.net Started upload of hadam3p_eu_9xz6_1991_1_008055259_0_12.zip Wed Jul 25 22:39:48 2012 climateprediction.net Finished upload of hadam3p_eu_9xz6_1991_1_008055259_0_12.zip Wed Jul 25 22:52:59 2012 climateprediction.net Started upload of hadam3p_eu_9xz6_1991_1_008055259_0_13.zip Wed Jul 25 22:53:02 2012 climateprediction.net Computation for task hadam3p_eu_9xz6_1991_1_008055259_0 finished Wed Jul 25 22:53:03 2012 climateprediction.net Starting hadam3p_eu_ctxm_2007_1_008084866_0 Wed Jul 25 22:53:03 2012 climateprediction.net Starting task hadam3p_eu_ctxm_2007_1_008084866_0 using hadam3p_eu version 609 Wed Jul 25 23:05:50 2012 climateprediction.net Finished upload of hadam3p_eu_9xz6_1991_1_008055259_0_13.zip mo. v - Was preparing this before I saw your post. |
©2024 cpdn.org