Message boards : Number crunching : Compute Errors on HadCM3 short Tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
My FX8150/16GB/Win7-64 host has errored out on every HadCM3 short task it has crunched today. And, all with the dreaded "INVALID THETA DETECTED". I wouldn't be too concerned if it was only a few tasks, but it's all of them. And, all my other hosts seem to be running them just fine. Wingmen have thrown errors, too, but only one resulting with the same error, that I can see. This host has had its share of errors on CPDN work, but I think it has been a fairly reliable cruncher in the past. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
That error may indicate that the researchers are looking at an area of parameter space near the edge of what's stable. In which case, these failures could be just what they're looking for. It's a bit reminiscent of what was being found back in the early days in 2003/2004, when tests were being run "all over the place". Perhaps check the model's 4 character code, and see if failures/successes are in the same name areas. PS The only one that I've run so far, was a Success. |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
i have 2 of the new batch (cm3s)...and 3 of the cm3p's... when running, each one runs for 10 or 11 seconds, exits with zero status, and immediately starts over...and runs again for 10 or 11 seconds and starts over... any ideas ??? frank |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
just checked my wingmen...2 of them seem to be having the same problem (10 seconds and EOJ)... frank |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Les Bayliss wrote: Perhaps check the model's 4 character code, and see if failures/successes are in the same name areas... If you mean the hadcm3s_XXXX designation, only 2 of the 12 models attempted are the same, "hadcm3s_1jul". Four are similar with hadcm3s_1pmX. None of those still in progress share either of those designations. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Mine was 17ov. Some of the failures were due to the faulty API code from BOINC. Others seem to be varied. The only thing that I can suggest is what 'they' were getting ready to say in England during WW II: Stay calm and carry on. A late thought: If there's a high failure rate, Andy can: A) issue a huge batch in the hope that enough will survive, or B) Find and fix a few common denominators. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Finished 3 more successfully. I think that the problems will be on Windows, especially if run as a service install. |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
It looks like I may have just been extremely unlucky at first, as my problem host is more than a day into two new tasks. Wingmen on most of the other tasks that I crashed also bombed out for various reasons; curiously, though, a couple were completed successfully. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,036,409 RAC: 14,604 |
Running a couple of the "short" runs at the moment. Every 10 secs or so the elapsed and remaining time counters "hiccup" staying on the same time for a second before continiuing. Otherwise look OK. Won't know more until next Wednesday - we are closed for a Bank Holiday on Monday and an extra Uni closed day on Tuesday. Fingers crossed! |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
chavk: are you seeing any increase in completion percentage ??? frank |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,036,409 RAC: 14,604 |
Got one completed - hadcm3s_19gg_1980_2_008916538. At home now so can't check on the other one but the elapsed time was increasing and the remaining time decreasing so -- fingers crossed. I'll know more Wednesday. |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Earlier I wrote: It looks like I may have just been extremely unlucky at first, as my problem host is more than a day into two new tasks... And now they have finished with no apparent problem. No credit, yet, but that's another issue... ;-) |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,036,409 RAC: 14,604 |
Two running on my home machine completed OK. Checking my tasks on my account page it looks as if three running on my work computer have completed OK as well. |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
I have now crashed 8 of these units, with no successful completions. Since these seemed to be glitching for others I paused other work and let them run for a bit to see if they were stable. All crashed after about 20 min. One hadcm3n unit, one hadam3p_eu unit, and two _pnw units are running normally. Win7, 64 bit, BOINC release 7.2.42. It appears some are getting better results than others, so I'm leaving these units for those. I have unchecked the box next to these units in my CPDN user preferences. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
The one I have running has got past 10% so I presume it is OK. I imagine there should be enough information out there now to look for commonalities between the machines that crash them. Also the commonalities between those that succeed. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
I have one of these short Tasks waiting to run ... four (4) other computors have already Error while computing ... my computor is running 24/7 ... but my computor propabaly won't get to this shot task untill a week or more ... from now. Should I abort this short task or just leave it ? name hadcm3s_19ad_1980_2_008916319 application UK Met Office HadCM3 short created 18 Aug 2014 21:38:36 UTC minimum quorum 1 initial replication 1 max # of error/total/success tasks 5, 5, 1 Error while computing http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9060494 |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,826,970 RAC: 5,066 |
The ones I have had (which have all crashed) have not taken long to crash, so you might as well run the one you have. There are plenty of examples of models running successfully even though all the other models in that work unit have crashed. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
For what it's worth: My machines completed 108 HadCM3s tasks so far -- plus ten failures, seven of which failed within seconds. If memory serves, the other three fell victim to a power-interruption & restart (along with a HadCM3n with about 300 hours). About two dozen HadCM3s running now in various stages of completion; barring power problems, all should finish okay. The machines all run Intel CPUs (Q9300 to i5-4670 Haswell), all with 32-bit boinc v.6.*, most with 6.2.19. All run stock speed. [EDIT: All run Windows_64: Vista, Win7, and (UGH!) one Win8.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
The ones I have had (which have all crashed) have not taken long to crash, so you might as well run the one you have. There are plenty of examples of models running successfully even though all the other models in that work unit have crashed. thank you Iain. I will let all the ones I have run. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
For what it's worth: Hi astroWX, thank you for that information. My computer is also Intel CPUs --- a Dual socket Dell WorkStation. so hopeful my short tasks will finish ok also :) |
©2024 cpdn.org