climateprediction.net (CPDN) home page
Thread 'HadCM3 short errors'

Thread 'HadCM3 short errors'

Message boards : Number crunching : HadCM3 short errors
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
ProfileBlurf

Send message
Joined: 13 Jun 08
Posts: 6
Credit: 1,372,493
RAC: 0
Message 51322 - Posted: 27 Jan 2015, 3:46:43 UTC

Downloads failed on every one of them for a cpl days now
ID: 51322 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 51327 - Posted: 27 Jan 2015, 6:13:32 UTC

Your failures all are for work units that were made back in Sept '14 and a few from Oct '14, no one has been able to run these work units they are faulty. The successful ones you have run come from Jan '15.

We seem to have to wait till they all cycle through the system to get rid of them.

I have been getting similar errors.

Conan
ID: 51327 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 51334 - Posted: 27 Jan 2015, 15:52:38 UTC - in response to Message 51327.  

Andy says that the Sep/Oct 2014 batches are no longer needed by the researchers and have been removed from the server this morning - they should not trouble us any further.

There is a later batch still in progress, which can be identified by task names starting hadcm3s_7 - the researchers do still need this new batch, and they should be allowed to run.
ID: 51334 · Report as offensive     Reply Quote
ProfileBlurf

Send message
Joined: 13 Jun 08
Posts: 6
Credit: 1,372,493
RAC: 0
Message 51336 - Posted: 28 Jan 2015, 0:05:27 UTC

Got it. Thank you for the update!
ID: 51336 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,034,731
RAC: 14,558
Message 51776 - Posted: 6 Apr 2015, 22:16:48 UTC - in response to Message 51336.  

Have just had some computation errors on hadcm3s-4 models with a year date of 2007. Checking on my account these were mnarked as no-resubmission so have been aborted.
ID: 51776 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51777 - Posted: 6 Apr 2015, 23:06:02 UTC - in response to Message 51776.  

I have seen that too, and unfortunately they run for 8 to 9 hours before they fail.
ID: 51777 · Report as offensive     Reply Quote
Bob Browett

Send message
Joined: 31 Aug 04
Posts: 11
Credit: 2,558,802
RAC: 0
Message 51779 - Posted: 7 Apr 2015, 7:30:49 UTC

Ah!
That explains why all my nits are currently going "phut" after wasting my time for 10 hours...
Do they know how many more there are? I will go and crunch for World Community Grid for a coupe of weeks, and then come back to CPDN.

Regards

Bob
ID: 51779 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51781 - Posted: 7 Apr 2015, 8:24:16 UTC - in response to Message 51779.  

There are currently no more "short" models, as can be seen on the Server Status page, 5th from the bottom in the blue menu to the left.

(Except, of course, the usual few re-tries that have failed on other computers.)

ID: 51781 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 2,013,293
RAC: 392
Message 51787 - Posted: 7 Apr 2015, 16:35:54 UTC
Last modified: 7 Apr 2015, 16:39:48 UTC

It seems the server decided to sent sure to fail "No Resubmission" models from November 2014 ( or someone pushed the wrong button :-P ). I just received two, one failed (error: Out Of Memory), the other I aborted, because it has failed on three other people already (also error: Out Of Memory).


Professor Desty Nova
Researching Karma the Hard Way
ID: 51787 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 51789 - Posted: 7 Apr 2015, 23:24:17 UTC - in response to Message 51787.  

Yup, just noticed a bad run, all "no resubmission" from 19 Nov workunits.

Looks like things will be quiet with no more models for the moment, but hey that's life. Might be a good time to blow the dust out of the PC :-)
ID: 51789 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 51947 - Posted: 12 May 2015, 16:57:59 UTC

There seems to be another batch of "No resubmission" jobs (originally from 22nd December 2014) - I'd had several of these fail (memory allocation error) before I realized...

Since the first of these turned up, I've not seen a single hadcm3s job that isn't from that bad batch, though I presume not all 35,000+ jobs available according to the server status page are bad jobs. So I'm left wondering whether to babysit BOINC/CPDN to watch for bad jobs or to [temporarily] stop taking hadcm3s jobs at all...

Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead...

ID: 51947 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51948 - Posted: 12 May 2015, 20:29:16 UTC - in response to Message 51947.  
Last modified: 12 May 2015, 20:31:13 UTC

Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead...

It could be worse. I had both of my Win7 64-bit machines that were doing the shorts do BSODs on me in the last 24 hours, the first time they have ever done that. I could recover from one (with a CHKDSK for errors), but had to reload the OS on the other. The only thing I can see is that I was picking up a lot of HadCM3 short errors at that time. I didn't know that they could crash machines, but you learn something every day.
ID: 51948 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,337,237
RAC: 12,975
Message 51949 - Posted: 12 May 2015, 23:23:33 UTC - in response to Message 51948.  

All the 1980s seem to be "No Resubmission".
It would have helpful if notice could have been given that there was a "Rogue" batch, so that we could abort them.
ID: 51949 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 51950 - Posted: 13 May 2015, 4:54:54 UTC - in response to Message 51948.  

Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead...

It could be worse. I had both of my Win7 64-bit machines that were doing the shorts do BSODs on me in the last 24 hours, the first time they have ever done that. I could recover from one (with a CHKDSK for errors), but had to reload the OS on the other. The only thing I can see is that I was picking up a lot of HadCM3 short errors at that time. I didn't know that they could crash machines, but you learn something every day.


That�s strange. My Win7 machine went BSOD on me last week. That�s the first time it�s done that in about 18 months. Do you think it was a hadcm3s that caused it?


ID: 51950 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 51951 - Posted: 13 May 2015, 9:13:10 UTC - in response to Message 51950.  

That�s strange. My Win7 machine went BSOD on me last week. That�s the first time it�s done that in about 18 months. Do you think it was a hadcm3s that caused it?

Well it is a bit of speculation on my part, since it sounds rather unlikely. I also run other BOINC projects on those machines (ATLAS, WCG), but they have thus far never caused BSODs.

I am wondering though if a CPDN work unit fails, whether it might leave something behind in memory (I know it does on the disk drive). In that case, before the memory can be cleaned up by the OS, a new work unit could start up and exceed the working memory limit. In that case, it would overflow into virtual memory. If that happened too much, I could see a BSOD might happen then. But someone who is more of an expert than myself will have to judge whether that is possible. I don't think that two machines failing (really failing) and once is a coincidence though; there must be some common connection.
ID: 51951 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 51979 - Posted: 23 May 2015, 11:30:47 UTC

I seem to be getting a real spate of hadcms models which are marked "No Resubmission". It has got to the point where when one of these starts coming to me from the server I look to see if it is so marked and abort it before it has fully arrived. Saves bandwidth and time.
ID: 51979 · Report as offensive     Reply Quote
ProfileSolarSurfer

Send message
Joined: 10 Dec 04
Posts: 15
Credit: 4,870,098
RAC: 0
Message 51980 - Posted: 23 May 2015, 19:01:07 UTC

It seems to me that the error rate for short models is significantly greater on my AMD machines than it is on the Intel ones.
"Nothing will benefit human health and increase chances for survival of life on Earth as much as the evolution to a vegetarian diet."
- Einstein
ID: 51980 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,034,731
RAC: 14,558
Message 51981 - Posted: 23 May 2015, 22:30:32 UTC - in response to Message 51979.  

These are all a batch of 1980 models which will fail due to an error. They should have been stopped from release but have slipped through. I expect Les will confirm that they should be aborted - I have had about 12 in the last few days and have aborted them all. There was a post in another thread about the same problem some months ago.
ID: 51981 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 51982 - Posted: 24 May 2015, 7:31:11 UTC - in response to Message 51981.  
Last modified: 24 May 2015, 7:31:36 UTC

Alan, While most of my "No Resubmission" tasks are the 1980s batch, a few are not, so it is necessary to check them all.
ID: 51982 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,034,731
RAC: 14,558
Message 51983 - Posted: 24 May 2015, 9:42:59 UTC - in response to Message 51982.  

Just been checking some of my other tasks which have dates 1994 and 2004 which are giving compute errors on other machines. Either no heartbeat or invalid theta errors. I guess these are going to fail at some point. These are not "no resubmission" tasks.
ID: 51983 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : HadCM3 short errors

©2024 cpdn.org