Message boards : Number crunching : Multiple failures
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jan 11 Posts: 175 Credit: 6,242,691 RAC: 699 |
Computer ID: 1128383 All tasks that I have run recently have ended in some sort of failure - not just 'Short' tasks. I don't really know where to start showing the error messages there are so many. Can anyone tell me if this is a general Mac problem / problem in general or do I have a problem with my machine? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Dave I think that it's Macs and Windows in general, for a lot of the current model types. Only Linux seems to be free of the high attrition rates. |
Send message Joined: 15 Jan 11 Posts: 175 Credit: 6,242,691 RAC: 699 |
Hi Les, Thanks for the reply. Just been looking at one task in particular, where the last zip file was created and uploaded without an apparent error - you've probably had this kind of data before but just in case :- Task Id -17464219 The messages on my Mac showed this as finishing OK Sun 30 Nov 07:54:02 2014 climateprediction.net Computation for task hadam3p_pnw_h25q_2012_1_009207513_0 finished Sun 30 Nov 07:54:02 2014 climateprediction.net Started upload of hadam3p_pnw_h25q_2012_1_009207513_0_13.zip Sun 30 Nov 08:09:13 2014 climateprediction.net Finished upload of hadam3p_pnw_h25q_2012_1_009207513_0_13.zip But the site info showed a failure :- hadam3p_pnw_h25q_2012_1_009207513_0 Workunit 9333135 Created 21 Nov 2014 17:42:01 UTC Sent 21 Nov 2014 20:52:20 UTC Received 30 Nov 2014 8:50:37 UTC Server state Over Outcome Client error Client state Compute error Exit status 9 (0x9) Computer ID 1128383 Report deadline 4 Nov 2015 2:12:20 UTC Run time 355,591.26 CPU time 347,667.40 Validate state Invalid Claimed credit 0.00 Granted credit 507.13 application version UK Met Office HadAM3P-HadRM3P Pacific North West v7.22 Are there any clues as to what is going on here? David |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Exit status 9 is a common problem with Macs. But all of the data gets sent back, so it's more of an irritant on the tasks list. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
It's actually more than an irritant because the task gets reissued, which means that the Mac effort is duplicated by someone else. I've stopped running any 7-series model types on my Mac since they all fail with "code 9" and I don't want to encourage the project to consume twice as much electricity as is actually needed. The project has withdrawn one Mac application but, in my opinion, the 7-series Mac applications should all be withdrawn unless someone can turn up a class of Macs that do run them successfully - and perhaps from that identify a solution or workaround. I think I've reached my nag quota on this one. |
Send message Joined: 15 Jan 11 Posts: 175 Credit: 6,242,691 RAC: 699 |
Thanks guys - I'll keep a close watch on what I download. PS Which of the models aren't 7 series? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
6-series: HadAM3P-HadRM3P Europe 6.09 HadAM3P-HadRM3P Australia New Zealand 6.10 Coupled Model Full Resolution Ocean 6.07 (i.e. HADCM3N) There are some new ANZ models out at the moment and HADCM3N appear from time to time. |
Send message Joined: 24 Nov 14 Posts: 1 Credit: 53,815 RAC: 0 |
Just downloaded a UK MET Office HadCm3short 7.24 hadcm3s_4zuz_2007.2.009199733,4 again Computation error! All HadCm3short 7.24 have had errors. I do not no why. Using Boinc Manager 7.4.27 (X86) wxWidgets 3.0.1 |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
Just downloaded a UK MET Office HadCm3short 7.24 There is an article in the news thread, here, about this series of models. It is a good idea to subscribe to that thread to keep up to date. |
Send message Joined: 15 Jan 11 Posts: 175 Credit: 6,242,691 RAC: 699 |
Re :- Coupled Model Full Resolution Ocean v6.07 Hadcm3n There seem to be a lot of early failures with this model recently. Is there a known problem that needs to be resolved or are these cases of pushing the boundaries? Is it worth downloading tasks? The following are just WU's from which I've had tasks and which also had early failures on other boxes. Workunits 9244652 9245944 9245646 9278183 |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
Looks like a combination of reissues, machines that don't finish much at all and the known decade sensitivity for the work units that crashed on your machine. Since you have a Mac it might be worth stocking up on ANZ and EU models until a clearer picture emerges about the new HADCM3N batch. |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
Definitely something strange is going on. My Windows 8.1 box is working through lots of Hadcm3n's that have failed on other boxes. Likewise, it is successful with lots of PNWs that have failed elsewhere. On the other hand, every ANZ for my box results in a failed download, so I've changed my preferences to exclude them. As for the EUs, although included in my preferences, I've yet to receive one! |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
I forgot to add that I've yet to receive an African task either. |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
A further thought. The work unit numbers of 3 of your failures may be significant. those numbered 924---- were issued in late October and had multiple failures back then. You perhaps had some that had been sitting around on a slow box before they failed and were re-issued to you. Regarding your other work unit, I've been successful with work units in the 927---- range. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
ed2353: The Windows 8.1 machine completed an ANZ task on 10 October, but is now failing because it can't download the executable file - which it must have had at the time the earlier ANZ completed. So the question is how has the ANZ executable disappeared from that machine? |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
Perhaps some clumsy "tidying" by me! I may have an old enough backup of BOINC, but if not, how can I get it back? |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
hadam3p_anz_6.10_windows_intelx86.exe Is present in the project folder. |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
The Download Error given is <core_client_version>7.2.33</core_client_version> <![CDATA[ <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>hadam3p_anz_se_6.10_windows_intelx86.zip</file_name> <error_code>-200</error_code> </file_xfer_error> </message> ]]> |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There are a LOT of files downloaded for each data set, which you can see by looking at the Event Log. Usually, the message app_version download error: couldn't get input files: means that the file is no longer on the server, because that data set is from way back, and the relevant files have been deleted on the server. Also, the missing file is: hadam3p_anz_se_6.10_windows_intelx86.zip whereas the file that you found is: hadam3p_anz_6.10_windows_intelx86.exe |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
Happy New Year to you Les. hadam3p_anz_se_6.10_windows_intelx86.zip is also present in the folder |
©2024 cpdn.org