climateprediction.net (CPDN) home page
Thread 'Compute Errors on HadCM3 short Tasks'

Thread 'Compute Errors on HadCM3 short Tasks'

Message boards : Number crunching : Compute Errors on HadCM3 short Tasks
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49782 - Posted: 19 Aug 2014, 23:43:38 UTC

My FX8150/16GB/Win7-64 host has errored out on every HadCM3 short task it has crunched today. And, all with the dreaded "INVALID THETA DETECTED".

I wouldn't be too concerned if it was only a few tasks, but it's all of them. And, all my other hosts seem to be running them just fine. Wingmen have thrown errors, too, but only one resulting with the same error, that I can see. This host has had its share of errors on CPDN work, but I think it has been a fairly reliable cruncher in the past.
ID: 49782 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49783 - Posted: 20 Aug 2014, 0:46:39 UTC - in response to Message 49782.  

That error may indicate that the researchers are looking at an area of parameter space near the edge of what's stable. In which case, these failures could be just what they're looking for.
It's a bit reminiscent of what was being found back in the early days in 2003/2004, when tests were being run "all over the place".

Perhaps check the model's 4 character code, and see if failures/successes are in the same name areas.

PS
The only one that I've run so far, was a Success.


ID: 49783 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 49784 - Posted: 20 Aug 2014, 1:07:04 UTC

i have 2 of the new batch (cm3s)...and 3 of the cm3p's...

when running, each one runs for 10 or 11 seconds, exits with zero status, and immediately starts over...and runs again for 10 or 11 seconds and starts over...

any ideas ???

frank
ID: 49784 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 49785 - Posted: 20 Aug 2014, 1:25:19 UTC

just checked my wingmen...2 of them seem to be having the same problem (10 seconds and EOJ)...

frank
ID: 49785 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49786 - Posted: 20 Aug 2014, 1:32:37 UTC - in response to Message 49783.  

Les Bayliss wrote:
Perhaps check the model's 4 character code, and see if failures/successes are in the same name areas...

If you mean the hadcm3s_XXXX designation, only 2 of the 12 models attempted are the same, "hadcm3s_1jul". Four are similar with hadcm3s_1pmX. None of those still in progress share either of those designations.

ID: 49786 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49787 - Posted: 20 Aug 2014, 3:46:10 UTC
Last modified: 20 Aug 2014, 3:50:57 UTC

Mine was 17ov.

Some of the failures were due to the faulty API code from BOINC.
Others seem to be varied.

The only thing that I can suggest is what 'they' were getting ready to say in England during WW II:

Stay calm and carry on.

A late thought:
If there's a high failure rate, Andy can:
A) issue a huge batch in the hope that enough will survive,
or
B) Find and fix a few common denominators.
ID: 49787 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49830 - Posted: 22 Aug 2014, 1:54:31 UTC - in response to Message 49787.  

Finished 3 more successfully.
I think that the problems will be on Windows, especially if run as a service install.

ID: 49830 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49832 - Posted: 22 Aug 2014, 2:58:51 UTC

It looks like I may have just been extremely unlucky at first, as my problem host is more than a day into two new tasks. Wingmen on most of the other tasks that I crashed also bombed out for various reasons; curiously, though, a couple were completed successfully.
ID: 49832 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,083,753
RAC: 15,077
Message 49835 - Posted: 22 Aug 2014, 9:28:27 UTC - in response to Message 49832.  

Running a couple of the "short" runs at the moment. Every 10 secs or so the elapsed and remaining time counters "hiccup" staying on the same time for a second before continiuing. Otherwise look OK. Won't know more until next Wednesday - we are closed for a Bank Holiday on Monday and an extra Uni closed day on Tuesday. Fingers crossed!
ID: 49835 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 49837 - Posted: 22 Aug 2014, 17:26:02 UTC - in response to Message 49835.  

chavk: are you seeing any increase in completion percentage ???

frank
ID: 49837 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,083,753
RAC: 15,077
Message 49840 - Posted: 22 Aug 2014, 21:41:59 UTC - in response to Message 49837.  

Got one completed - hadcm3s_19gg_1980_2_008916538. At home now so can't check on the other one but the elapsed time was increasing and the remaining time decreasing so -- fingers crossed. I'll know more Wednesday.
ID: 49840 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49841 - Posted: 22 Aug 2014, 23:10:51 UTC - in response to Message 49832.  

Earlier I wrote:
It looks like I may have just been extremely unlucky at first, as my problem host is more than a day into two new tasks...

And now they have finished with no apparent problem. No credit, yet, but that's another issue... ;-)

ID: 49841 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,083,753
RAC: 15,077
Message 49850 - Posted: 24 Aug 2014, 16:36:01 UTC - in response to Message 49841.  
Last modified: 24 Aug 2014, 17:24:14 UTC

Two running on my home machine completed OK. Checking my tasks on my account page it looks as if three running on my work computer have completed OK as well.
ID: 49850 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 49856 - Posted: 25 Aug 2014, 16:26:13 UTC

I have now crashed 8 of these units, with no successful completions.

Since these seemed to be glitching for others I paused other work and let them run for a bit to see if they were stable. All crashed after about 20 min. One hadcm3n unit, one hadam3p_eu unit, and two _pnw units are running normally.

Win7, 64 bit, BOINC release 7.2.42.

It appears some are getting better results than others, so I'm leaving these units for those. I have unchecked the box next to these units in my CPDN user preferences.
ID: 49856 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 49857 - Posted: 25 Aug 2014, 16:45:32 UTC

The one I have running has got past 10% so I presume it is OK. I imagine there should be enough information out there now to look for commonalities between the machines that crash them. Also the commonalities between those that succeed.
ID: 49857 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 49858 - Posted: 25 Aug 2014, 16:58:27 UTC

I have one of these short Tasks waiting to run ... four (4) other computors have already Error while computing ... my computor is running 24/7 ... but my computor propabaly won't get to this shot task untill a week or more ... from now. Should I abort this short task or just leave it ?

name hadcm3s_19ad_1980_2_008916319
application UK Met Office HadCM3 short
created 18 Aug 2014 21:38:36 UTC
minimum quorum 1
initial replication 1
max # of error/total/success tasks 5, 5, 1

Error while computing

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9060494
ID: 49858 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,841,902
RAC: 5,047
Message 49859 - Posted: 25 Aug 2014, 17:49:19 UTC

The ones I have had (which have all crashed) have not taken long to crash, so you might as well run the one you have. There are plenty of examples of models running successfully even though all the other models in that work unit have crashed.
ID: 49859 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 49860 - Posted: 25 Aug 2014, 18:13:01 UTC
Last modified: 25 Aug 2014, 18:19:01 UTC

For what it's worth:

My machines completed 108 HadCM3s tasks so far -- plus ten failures, seven of which failed within seconds. If memory serves, the other three fell victim to a power-interruption & restart (along with a HadCM3n with about 300 hours).

About two dozen HadCM3s running now in various stages of completion; barring power problems, all should finish okay.

The machines all run Intel CPUs (Q9300 to i5-4670 Haswell), all with 32-bit boinc v.6.*, most with 6.2.19. All run stock speed. [EDIT: All run Windows_64: Vista, Win7, and (UGH!) one Win8.]
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 49860 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 49861 - Posted: 25 Aug 2014, 18:40:36 UTC - in response to Message 49859.  

The ones I have had (which have all crashed) have not taken long to crash, so you might as well run the one you have. There are plenty of examples of models running successfully even though all the other models in that work unit have crashed.

thank you Iain. I will let all the ones I have run.
ID: 49861 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 49862 - Posted: 25 Aug 2014, 18:49:07 UTC - in response to Message 49860.  

For what it's worth:

My machines completed 108 HadCM3s tasks so far -- plus ten failures, seven of which failed within seconds. If memory serves, the other three fell victim to a power-interruption & restart (along with a HadCM3n with about 300 hours).

About two dozen HadCM3s running now in various stages of completion; barring power problems, all should finish okay.

The machines all run Intel CPUs (Q9300 to i5-4670 Haswell), all with 32-bit boinc v.6.*, most with 6.2.19. All run stock speed. [EDIT: All run Windows_64: Vista, Win7, and (UGH!) one Win8.]

Hi astroWX,

thank you for that information.

My computer is also Intel CPUs --- a Dual socket Dell WorkStation.

so hopeful my short tasks will finish ok also :)
ID: 49862 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Compute Errors on HadCM3 short Tasks

©2024 cpdn.org