climateprediction.net (CPDN) home page
Thread 'Multiple failures'

Thread 'Multiple failures'

Message boards : Number crunching : Multiple failures
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 50903 - Posted: 1 Dec 2014, 0:05:09 UTC

Computer ID: 1128383

All tasks that I have run recently have ended in some sort of failure - not just 'Short' tasks.

I don't really know where to start showing the error messages there are so many.

Can anyone tell me if this is a general Mac problem / problem in general or do I have a problem with my machine?

ID: 50903 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50904 - Posted: 1 Dec 2014, 0:51:56 UTC - in response to Message 50903.  

Hi Dave

I think that it's Macs and Windows in general, for a lot of the current model types.

Only Linux seems to be free of the high attrition rates.

ID: 50904 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 50910 - Posted: 1 Dec 2014, 17:59:45 UTC - in response to Message 50904.  

Hi Les,

Thanks for the reply.
Just been looking at one task in particular, where the last zip file was created and uploaded without an apparent error - you've probably had this kind of data before but just in case :-

Task Id -17464219
The messages on my Mac showed this as finishing OK

Sun 30 Nov 07:54:02 2014 climateprediction.net Computation for task hadam3p_pnw_h25q_2012_1_009207513_0 finished

Sun 30 Nov 07:54:02 2014 climateprediction.net Started upload of hadam3p_pnw_h25q_2012_1_009207513_0_13.zip

Sun 30 Nov 08:09:13 2014 climateprediction.net Finished upload of hadam3p_pnw_h25q_2012_1_009207513_0_13.zip

But the site info showed a failure :-

hadam3p_pnw_h25q_2012_1_009207513_0
Workunit 9333135
Created 21 Nov 2014 17:42:01 UTC
Sent 21 Nov 2014 20:52:20 UTC
Received 30 Nov 2014 8:50:37 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 9 (0x9)
Computer ID 1128383
Report deadline 4 Nov 2015 2:12:20 UTC
Run time 355,591.26
CPU time 347,667.40
Validate state Invalid
Claimed credit 0.00
Granted credit 507.13
application version UK Met Office HadAM3P-HadRM3P Pacific North West v7.22

Are there any clues as to what is going on here?

David
ID: 50910 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50915 - Posted: 1 Dec 2014, 21:37:28 UTC - in response to Message 50910.  

Exit status 9 is a common problem with Macs.
But all of the data gets sent back, so it's more of an irritant on the tasks list.

ID: 50915 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 50916 - Posted: 1 Dec 2014, 23:59:24 UTC

It's actually more than an irritant because the task gets reissued, which means that the Mac effort is duplicated by someone else. I've stopped running any 7-series model types on my Mac since they all fail with "code 9" and I don't want to encourage the project to consume twice as much electricity as is actually needed.

The project has withdrawn one Mac application but, in my opinion, the 7-series Mac applications should all be withdrawn unless someone can turn up a class of Macs that do run them successfully - and perhaps from that identify a solution or workaround. I think I've reached my nag quota on this one.
ID: 50916 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 50920 - Posted: 2 Dec 2014, 11:54:17 UTC - in response to Message 50916.  
Last modified: 2 Dec 2014, 12:01:55 UTC

Thanks guys - I'll keep a close watch on what I download.

PS Which of the models aren't 7 series?
ID: 50920 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 50921 - Posted: 2 Dec 2014, 12:36:48 UTC

6-series:

HadAM3P-HadRM3P Europe 6.09
HadAM3P-HadRM3P Australia New Zealand 6.10
Coupled Model Full Resolution Ocean 6.07 (i.e. HADCM3N)

There are some new ANZ models out at the moment and HADCM3N appear from time to time.
ID: 50921 · Report as offensive     Reply Quote
Boots

Send message
Joined: 24 Nov 14
Posts: 1
Credit: 53,815
RAC: 0
Message 50933 - Posted: 4 Dec 2014, 4:54:07 UTC

Just downloaded a UK MET Office HadCm3short 7.24
hadcm3s_4zuz_2007.2.009199733,4 again Computation error! All HadCm3short 7.24 have had errors. I do not no why.

Using Boinc Manager 7.4.27 (X86) wxWidgets 3.0.1

ID: 50933 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 50934 - Posted: 4 Dec 2014, 7:35:39 UTC - in response to Message 50933.  
Last modified: 4 Dec 2014, 7:36:18 UTC

Just downloaded a UK MET Office HadCm3short 7.24
hadcm3s_4zuz_2007.2.009199733,4 again Computation error! All HadCm3short 7.24 have had errors. I do not no why.

Using Boinc Manager 7.4.27 (X86) wxWidgets 3.0.1


There is an article in the news thread, here, about this series of models. It is a good idea to subscribe to that thread to keep up to date.
ID: 50934 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 51086 - Posted: 31 Dec 2014, 10:34:52 UTC

Re :- Coupled Model Full Resolution Ocean v6.07 Hadcm3n

There seem to be a lot of early failures with this model recently. Is there a known problem that needs to be resolved or are these cases of pushing the boundaries? Is it worth downloading tasks?
The following are just WU's from which I've had tasks and which also had early failures on other boxes.
Workunits
9244652
9245944
9245646
9278183
ID: 51086 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 51087 - Posted: 31 Dec 2014, 15:23:29 UTC - in response to Message 51086.  

Looks like a combination of reissues, machines that don't finish much at all and the known decade sensitivity for the work units that crashed on your machine. Since you have a Mac it might be worth stocking up on ANZ and EU models until a clearer picture emerges about the new HADCM3N batch.
ID: 51087 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,290,001
RAC: 13,288
Message 51088 - Posted: 31 Dec 2014, 18:07:21 UTC - in response to Message 51087.  

Definitely something strange is going on.

My Windows 8.1 box is working through lots of Hadcm3n's that have failed on other boxes.
Likewise, it is successful with lots of PNWs that have failed elsewhere.

On the other hand, every ANZ for my box results in a failed download, so I've changed my preferences to exclude them.

As for the EUs, although included in my preferences, I've yet to receive one!

ID: 51088 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,290,001
RAC: 13,288
Message 51089 - Posted: 31 Dec 2014, 18:10:57 UTC - in response to Message 51088.  

I forgot to add that I've yet to receive an African task either.
ID: 51089 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,290,001
RAC: 13,288
Message 51090 - Posted: 31 Dec 2014, 19:04:46 UTC - in response to Message 51089.  

A further thought.

The work unit numbers of 3 of your failures may be significant.
those numbered 924---- were issued in late October and had multiple failures back then.
You perhaps had some that had been sitting around on a slow box before they failed and were re-issued to you.

Regarding your other work unit, I've been successful with work units in the 927---- range.
ID: 51090 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 51091 - Posted: 31 Dec 2014, 19:28:56 UTC - in response to Message 51088.  
Last modified: 31 Dec 2014, 19:29:10 UTC

ed2353: The Windows 8.1 machine completed an ANZ task on 10 October, but is now failing because it can't download the executable file - which it must have had at the time the earlier ANZ completed. So the question is how has the ANZ executable disappeared from that machine?
ID: 51091 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,290,001
RAC: 13,288
Message 51093 - Posted: 1 Jan 2015, 1:25:10 UTC - in response to Message 51091.  

Perhaps some clumsy "tidying" by me!

I may have an old enough backup of BOINC, but if not, how can I get it back?
ID: 51093 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,290,001
RAC: 13,288
Message 51094 - Posted: 1 Jan 2015, 1:38:59 UTC - in response to Message 51091.  
Last modified: 1 Jan 2015, 1:39:50 UTC

hadam3p_anz_6.10_windows_intelx86.exe

Is present in the project folder.
ID: 51094 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,290,001
RAC: 13,288
Message 51095 - Posted: 1 Jan 2015, 1:59:52 UTC - in response to Message 51091.  

The Download Error given is

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<message>
app_version download error: couldn't get input files:
<file_xfer_error>
<file_name>hadam3p_anz_se_6.10_windows_intelx86.zip</file_name>
<error_code>-200</error_code>
</file_xfer_error>

</message>
]]>
ID: 51095 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51096 - Posted: 1 Jan 2015, 2:45:04 UTC

There are a LOT of files downloaded for each data set, which you can see by looking at the Event Log.
Usually, the message app_version download error: couldn't get input files: means that the file is no longer on the server, because that data set is from way back, and the relevant files have been deleted on the server.

Also, the missing file is:
hadam3p_anz_se_6.10_windows_intelx86.zip
whereas the file that you found is:
hadam3p_anz_6.10_windows_intelx86.exe

ID: 51096 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,290,001
RAC: 13,288
Message 51097 - Posted: 1 Jan 2015, 11:02:41 UTC - in response to Message 51096.  

Happy New Year to you Les.

hadam3p_anz_se_6.10_windows_intelx86.zip is also present in the folder
ID: 51097 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Multiple failures

©2024 cpdn.org