Thread 'HadCM3 short errors'

Author	Message
Blurf Send message Joined: 13 Jun 08 Posts: 6 Credit: 1,372,493 RAC: 0	Message 51322 - Posted: 27 Jan 2015, 3:46:43 UTC Downloads failed on every one of them for a cpl days now ID: 51322 · Reply Quote

Conan Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420	Message 51327 - Posted: 27 Jan 2015, 6:13:32 UTC Your failures all are for work units that were made back in Sept '14 and a few from Oct '14, no one has been able to run these work units they are faulty. The successful ones you have run come from Jan '15. We seem to have to wait till they all cycle through the system to get rid of them. I have been getting similar errors. Conan ID: 51327 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,718,239 RAC: 8,054	Message 51334 - Posted: 27 Jan 2015, 15:52:38 UTC - in response to Message 51327. Andy says that the Sep/Oct 2014 batches are no longer needed by the researchers and have been removed from the server this morning - they should not trouble us any further. There is a later batch still in progress, which can be identified by task names starting hadcm3s_7 - the researchers do still need this new batch, and they should be allowed to run. ID: 51334 · Reply Quote

Blurf Send message Joined: 13 Jun 08 Posts: 6 Credit: 1,372,493 RAC: 0	Message 51336 - Posted: 28 Jan 2015, 0:05:27 UTC Got it. Thank you for the update! ID: 51336 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,033,903 RAC: 14,766	Message 51776 - Posted: 6 Apr 2015, 22:16:48 UTC - in response to Message 51336. Have just had some computation errors on hadcm3s-4 models with a year date of 2007. Checking on my account these were mnarked as no-resubmission so have been aborted. ID: 51776 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 51777 - Posted: 6 Apr 2015, 23:06:02 UTC - in response to Message 51776. I have seen that too, and unfortunately they run for 8 to 9 hours before they fail. ID: 51777 · Reply Quote

Bob Browett Send message Joined: 31 Aug 04 Posts: 11 Credit: 2,558,802 RAC: 0	Message 51779 - Posted: 7 Apr 2015, 7:30:49 UTC Ah! That explains why all my nits are currently going "phut" after wasting my time for 10 hours... Do they know how many more there are? I will go and crunch for World Community Grid for a coupe of weeks, and then come back to CPDN. Regards Bob ID: 51779 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 51781 - Posted: 7 Apr 2015, 8:24:16 UTC - in response to Message 51779. There are currently no more "short" models, as can be seen on the Server Status page, 5th from the bottom in the blue menu to the left. (Except, of course, the usual few re-tries that have failed on other computers.) ID: 51781 · Reply Quote

Professor Desty Nova Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,013,293 RAC: 392	Message 51787 - Posted: 7 Apr 2015, 16:35:54 UTC Last modified: 7 Apr 2015, 16:39:48 UTC It seems the server decided to sent sure to fail "No Resubmission" models from November 2014 ( or someone pushed the wrong button :-P ). I just received two, one failed (error: Out Of Memory), the other I aborted, because it has failed on three other people already (also error: Out Of Memory). Professor Desty Nova Researching Karma the Hard Way ID: 51787 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 51789 - Posted: 7 Apr 2015, 23:24:17 UTC - in response to Message 51787. Yup, just noticed a bad run, all "no resubmission" from 19 Nov workunits. Looks like things will be quiet with no more models for the moment, but hey that's life. Might be a good time to blow the dust out of the PC :-) ID: 51789 · Reply Quote

alanb1951 Send message Joined: 31 Aug 04 Posts: 37 Credit: 9,581,380 RAC: 3,853	Message 51947 - Posted: 12 May 2015, 16:57:59 UTC There seems to be another batch of "No resubmission" jobs (originally from 22nd December 2014) - I'd had several of these fail (memory allocation error) before I realized... Since the first of these turned up, I've not seen a single hadcm3s job that isn't from that bad batch, though I presume not all 35,000+ jobs available according to the server status page are bad jobs. So I'm left wondering whether to babysit BOINC/CPDN to watch for bad jobs or to [temporarily] stop taking hadcm3s jobs at all... Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead... ID: 51947 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 51948 - Posted: 12 May 2015, 20:29:16 UTC - in response to Message 51947. Last modified: 12 May 2015, 20:31:13 UTC Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead... It could be worse. I had both of my Win7 64-bit machines that were doing the shorts do BSODs on me in the last 24 hours, the first time they have ever done that. I could recover from one (with a CHKDSK for errors), but had to reload the OS on the other. The only thing I can see is that I was picking up a lot of HadCM3 short errors at that time. I didn't know that they could crash machines, but you learn something every day. ID: 51948 · Reply Quote

ed2353 Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,336,409 RAC: 12,947	Message 51949 - Posted: 12 May 2015, 23:23:33 UTC - in response to Message 51948. All the 1980s seem to be "No Resubmission". It would have helpful if notice could have been given that there was a "Rogue" batch, so that we could abort them. ID: 51949 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 51950 - Posted: 13 May 2015, 4:54:54 UTC - in response to Message 51948. Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead... It could be worse. I had both of my Win7 64-bit machines that were doing the shorts do BSODs on me in the last 24 hours, the first time they have ever done that. I could recover from one (with a CHKDSK for errors), but had to reload the OS on the other. The only thing I can see is that I was picking up a lot of HadCM3 short errors at that time. I didn't know that they could crash machines, but you learn something every day. That�s strange. My Win7 machine went BSOD on me last week. That�s the first time it�s done that in about 18 months. Do you think it was a hadcm3s that caused it? ID: 51950 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 51951 - Posted: 13 May 2015, 9:13:10 UTC - in response to Message 51950. That�s strange. My Win7 machine went BSOD on me last week. That�s the first time it�s done that in about 18 months. Do you think it was a hadcm3s that caused it? Well it is a bit of speculation on my part, since it sounds rather unlikely. I also run other BOINC projects on those machines (ATLAS, WCG), but they have thus far never caused BSODs. I am wondering though if a CPDN work unit fails, whether it might leave something behind in memory (I know it does on the disk drive). In that case, before the memory can be cleaned up by the OS, a new work unit could start up and exceed the working memory limit. In that case, it would overflow into virtual memory. If that happened too much, I could see a BSOD might happen then. But someone who is more of an expert than myself will have to judge whether that is possible. I don't think that two machines failing (really failing) and once is a coincidence though; there must be some common connection. ID: 51951 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 51979 - Posted: 23 May 2015, 11:30:47 UTC I seem to be getting a real spate of hadcms models which are marked "No Resubmission". It has got to the point where when one of these starts coming to me from the server I look to see if it is so marked and abort it before it has fully arrived. Saves bandwidth and time. ID: 51979 · Reply Quote

SolarSurfer Send message Joined: 10 Dec 04 Posts: 15 Credit: 4,870,098 RAC: 0	Message 51980 - Posted: 23 May 2015, 19:01:07 UTC It seems to me that the error rate for short models is significantly greater on my AMD machines than it is on the Intel ones. "Nothing will benefit human health and increase chances for survival of life on Earth as much as the evolution to a vegetarian diet." - Einstein ID: 51980 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,033,903 RAC: 14,766	Message 51981 - Posted: 23 May 2015, 22:30:32 UTC - in response to Message 51979. These are all a batch of 1980 models which will fail due to an error. They should have been stopped from release but have slipped through. I expect Les will confirm that they should be aborted - I have had about 12 in the last few days and have aborted them all. There was a post in another thread about the same problem some months ago. ID: 51981 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 51982 - Posted: 24 May 2015, 7:31:11 UTC - in response to Message 51981. Last modified: 24 May 2015, 7:31:36 UTC Alan, While most of my "No Resubmission" tasks are the 1980s batch, a few are not, so it is necessary to check them all. ID: 51982 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,033,903 RAC: 14,766	Message 51983 - Posted: 24 May 2015, 9:42:59 UTC - in response to Message 51982. Just been checking some of my other tasks which have dates 1994 and 2004 which are giving compute errors on other machines. Either no heartbeat or invalid theta errors. I guess these are going to fail at some point. These are not "no resubmission" tasks. ID: 51983 · Reply Quote