Message boards : Number crunching : HadCM3 short errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Jun 08 Posts: 6 Credit: 1,372,493 RAC: 0 |
Downloads failed on every one of them for a cpl days now |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Your failures all are for work units that were made back in Sept '14 and a few from Oct '14, no one has been able to run these work units they are faulty. The successful ones you have run come from Jan '15. We seem to have to wait till they all cycle through the system to get rid of them. I have been getting similar errors. Conan |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
Andy says that the Sep/Oct 2014 batches are no longer needed by the researchers and have been removed from the server this morning - they should not trouble us any further. There is a later batch still in progress, which can be identified by task names starting hadcm3s_7 - the researchers do still need this new batch, and they should be allowed to run. |
Send message Joined: 13 Jun 08 Posts: 6 Credit: 1,372,493 RAC: 0 |
Got it. Thank you for the update! |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,975,898 RAC: 14,500 |
Have just had some computation errors on hadcm3s-4 models with a year date of 2007. Checking on my account these were mnarked as no-resubmission so have been aborted. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I have seen that too, and unfortunately they run for 8 to 9 hours before they fail. |
Send message Joined: 31 Aug 04 Posts: 11 Credit: 2,558,802 RAC: 0 |
Ah! That explains why all my nits are currently going "phut" after wasting my time for 10 hours... Do they know how many more there are? I will go and crunch for World Community Grid for a coupe of weeks, and then come back to CPDN. Regards Bob |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There are currently no more "short" models, as can be seen on the Server Status page, 5th from the bottom in the blue menu to the left. (Except, of course, the usual few re-tries that have failed on other computers.) |
Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,010,809 RAC: 335 |
It seems the server decided to sent sure to fail "No Resubmission" models from November 2014 ( or someone pushed the wrong button :-P ). I just received two, one failed (error: Out Of Memory), the other I aborted, because it has failed on three other people already (also error: Out Of Memory). Professor Desty Nova Researching Karma the Hard Way |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Yup, just noticed a bad run, all "no resubmission" from 19 Nov workunits. Looks like things will be quiet with no more models for the moment, but hey that's life. Might be a good time to blow the dust out of the PC :-) |
Send message Joined: 31 Aug 04 Posts: 37 Credit: 9,581,380 RAC: 3,853 |
There seems to be another batch of "No resubmission" jobs (originally from 22nd December 2014) - I'd had several of these fail (memory allocation error) before I realized... Since the first of these turned up, I've not seen a single hadcm3s job that isn't from that bad batch, though I presume not all 35,000+ jobs available according to the server status page are bad jobs. So I'm left wondering whether to babysit BOINC/CPDN to watch for bad jobs or to [temporarily] stop taking hadcm3s jobs at all... Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead... |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead... It could be worse. I had both of my Win7 64-bit machines that were doing the shorts do BSODs on me in the last 24 hours, the first time they have ever done that. I could recover from one (with a CHKDSK for errors), but had to reload the OS on the other. The only thing I can see is that I was picking up a lot of HadCM3 short errors at that time. I didn't know that they could crash machines, but you learn something every day. |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
All the 1980s seem to be "No Resubmission". It would have helpful if notice could have been given that there was a "Rogue" batch, so that we could abort them. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead... That�s strange. My Win7 machine went BSOD on me last week. That�s the first time it�s done that in about 18 months. Do you think it was a hadcm3s that caused it? |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
That�s strange. My Win7 machine went BSOD on me last week. That�s the first time it�s done that in about 18 months. Do you think it was a hadcm3s that caused it? Well it is a bit of speculation on my part, since it sounds rather unlikely. I also run other BOINC projects on those machines (ATLAS, WCG), but they have thus far never caused BSODs. I am wondering though if a CPDN work unit fails, whether it might leave something behind in memory (I know it does on the disk drive). In that case, before the memory can be cleaned up by the OS, a new work unit could start up and exceed the working memory limit. In that case, it would overflow into virtual memory. If that happened too much, I could see a BSOD might happen then. But someone who is more of an expert than myself will have to judge whether that is possible. I don't think that two machines failing (really failing) and once is a coincidence though; there must be some common connection. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
I seem to be getting a real spate of hadcms models which are marked "No Resubmission". It has got to the point where when one of these starts coming to me from the server I look to see if it is so marked and abort it before it has fully arrived. Saves bandwidth and time. |
Send message Joined: 10 Dec 04 Posts: 15 Credit: 4,870,098 RAC: 0 |
It seems to me that the error rate for short models is significantly greater on my AMD machines than it is on the Intel ones. "Nothing will benefit human health and increase chances for survival of life on Earth as much as the evolution to a vegetarian diet." - Einstein |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,975,898 RAC: 14,500 |
These are all a batch of 1980 models which will fail due to an error. They should have been stopped from release but have slipped through. I expect Les will confirm that they should be aborted - I have had about 12 in the last few days and have aborted them all. There was a post in another thread about the same problem some months ago. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Alan, While most of my "No Resubmission" tasks are the 1980s batch, a few are not, so it is necessary to check them all. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,975,898 RAC: 14,500 |
Just been checking some of my other tasks which have dates 1994 and 2004 which are giving compute errors on other machines. Either no heartbeat or invalid theta errors. I guess these are going to fail at some point. These are not "no resubmission" tasks. |
©2024 cpdn.org