Message boards : Number crunching : hadam3p_eu crash 45 seconds in.
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368 |
Following the post suggesting a different model series be aborted I aborted two of those models, downloaded 2 hadam3p_eu models, both crashed, one at 44 seconds and one at 45 seconds. I have successfully completed regional models in the past without issue on my Intel I5 linux box. Any ideas? |
Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258 |
I picked up one this morning which had previously errored after a few seconds on two other machines. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7523642 It may not be started for a while on my machine as I have a hadcm3n running in high priority mode and three other projects. |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
basically the same here...workunit hadam3p_eu_4hvn_1999_1_007334252_1 ran for 1min 5sec...it was the last of 4 hadam3p_eu workunits that failed in about the same length of time... |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
Make that 5 workunits that failed in 63 to 74 seconds each...just noticed the 5th one... |
Send message Joined: 5 Sep 07 Posts: 9 Credit: 10,783,131 RAC: 0 |
Well,my Hadam3p_eu_4g31_etc. started this morning at 05h 33m and gave finished at 05h 36m. That's a real quickie for more than 200hrs work isn't? :-) |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Me too. Two EU Regionals failed after a few seconds each. I have a CM3i still going along successfully, so it doesn't look like my BOINC installation is at fault. I have set CPDN to No New Tasks until there is some clarity. Is this just early life forcing parameters being a bit fierce or is there something more fundamental wrong with this batch? |
Send message Joined: 15 Feb 06 Posts: 16 Credit: 7,136,835 RAC: 8,361 |
Same here. 9 failed immediately. I have set to no new tasks and, since I now have only 3 models for my 8 core restarted WCG |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Is this just early life forcing parameters being a bit fierce or is there something more fundamental wrong with this batch? There is definitely something amiss with this batch of hadam3p_eu work. Andy has been investigating the problem all day. From the speed of the crash and the stderr messages on the failed tasks I'd hazard a guess that it's related to the parameters being passed to the global worker (hadam3p_eu_um_*) process. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0 |
After having reattached to CPDN, 3 of 4 eu's downloaded together this am erred also in less than 1 min. Strangely, one is still running with correct graphics. Failed ones showed blue planets while stating "no model is running". Hope this helps. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Darmok, I see that your computer(s) are hidden. Could you please give us a link to the webpage for your EU model that is running? Cpdn news |
Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0 |
Mo, Computer(s) now showing. Hasn't been any trickle yet. Model now running for 1.5 hours. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
After having reattached to CPDN, 3 of 4 eu's downloaded together this am erred also in less than 1 min. Strangely, one is still running with correct graphics. Failed ones showed blue planets while stating "no model is running". the one that keeps on running is from a different batch -- hadam3p_eu2**** not hadam3p_eu4** the 4* failures are all sigsegv or signal 11 -- same thing mostly -- not usually a model problem, more likely a compile problem or something more esoteric. Happening not on just your or my machines, but everywhere, as far as I can see from checking a few dozen leading cpus. I'm turning off eu downloads until early next week -- have a bunch of hadcm3n that will run for a long time Keep crunching, peace, happiness, this IS the bleeding edge sometimes -- noli perspirare Eric |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Mo, Oh, and PS thanks for posting Eric |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Why never test these loser WUs before testing the volunteers bandwidth? A few hundred thousand times 100 or so MB -- what's that to a volunteer? These obviously never tested WU -- EU4 -- yeah, you can figure it out later, after wasting my time and bandwidth. Why never test before sending a gazillion to us? Huh? Don't you newb clowns test anything before you send a few bazillion WUs out? Forgive me, I've been volunteering my machine's time for a decade -- Did you try even one of these loser models at home before you sent a few hundred thou out to us? Don't think so.. It's obvious. Please -- don't abuse the volunteers. Do some minimal testing before you send a totally wasteful broken model times 300,000 to us crunchers. OK? Actually, I'm really annoyed by this last batch of broken s*** that I download, it breaks, ----- Do you do ANY testing before sending this stuff? No, obviously not. And yes, I, and a few others, are annoyed. If you dare, apologize. Eric |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368 |
My understanding is that the wu's are autogenerated to a large extent. I guess it would be possible to put a few through some testing before putting out large numbers but that would imply a delay in some cases of a few weeks before it was known the units don't complete. Sure in this case where they fail after a few seconds perhaps it could have been picked up but this is the first batch I have come across that have failed that fast. I think Erik that this sort of thing is inevitable if a project of this magnitude is running on a shoestring budget. Dave |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
"In this case" they all hundred thou fail after a few seconds. Not counting the download time These are regional EU models -- a new batch. Not regen as far as anyone can see. A very short-term test could easily have caught this problem. And when the lot of them started failing Thursday -- who was watching? Shoestring budget, yeah, I've posted before about that, it's true. But this batch looks like pure slovenly slop to me. Sorry to be offensive, but that's how it looks. My understanding is that the wu's are autogenerated to a large extent. I guess it would be possible to put a few through some testing before putting out large numbers but that would imply a delay in some cases of a few weeks before it was known the units don't complete. Sure in this case where they fail after a few seconds perhaps it could have been picked up but this is the first batch I have come across that have failed that fast. I think Erik that this sort of thing is inevitable if a project of this magnitude is running on a shoestring budget. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368 |
In this case that does seem to be true - in the last one that crashed a lot of tasks they often did not crash till 100 hours in or more. Testing sufficient tasks to be sure it was universal would not have been realistic. this is the first time I have had almost instant crashes though. Dave |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
GET OVER IT EIRIK. In case you have forgotten, only last week all the complainants on this forum were about no work being available. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Let's keep it civil, please. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
It is not normal practice to test batches of work units that are expected to run without problems. The HADAM3P application has been run uneventfully thousands of times and no discussion of which I am aware has suggested that this batch is novel in a way that would have justified a beta test. The CPDN beta site is used to test models that are not expected to work first time, so that wouldn't be the place to test a routine release. Clearly, however, something is badly wrong. The units were released at close of play UK time on Thursday. Some failures were reported overnight and a formal report of that made by a moderator to the project team early Friday morning. That's about as quick as it could get, I think. Investigations then went on through Friday. It has been the case in the past that batches sometimes fail because a file is missing or there is some correctable problem server-side or only part of a batch is affected. The first response is not therefore to pull the batch but to determine the cause and fix the batch if possible. A definitive cause has not yet been reported anywhere that I can see. The fact that so many models are affected probably means that the cause will be relatively easy to find. In retrospect, of course, a test batch of a few tens of units would have saved a lot of wasted bandwidth. That has been done in the past for transfers from beta, for example, or other potentially suspect changes, but this batch was thought to be routine. Past work unit problems have ranged from culpable failures in configuration management through spelling errors to blameless weirdnesses in the HADSM/CM/AM models. It doesn't look like there'll be any EU models left by Monday morning, so we'll just have to wait for a fix to be applied and a new batch to be released - and perhaps as a mercy it will be released a bit at a time. |
©2024 cpdn.org