Thread 'hadam3p_eu crash 45 seconds in.'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 42512 - Posted: 1 Jul 2011, 10:12:18 UTC Following the post suggesting a different model series be aborted I aborted two of those models, downloaded 2 hadam3p_eu models, both crashed, one at 44 seconds and one at 45 seconds. I have successfully completed regional models in the past without issue on my Intel I5 linux box. Any ideas? ID: 42512 · Reply Quote

Nigel Garvey Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258	Message 42513 - Posted: 1 Jul 2011, 11:15:05 UTC I picked up one this morning which had previously errored after a few seconds on two other machines. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7523642 It may not be started for a while on my machine as I have a hadcm3n running in high priority mode and three other projects. ID: 42513 · Reply Quote

KWSN - Sir Frank of the Wood Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0	Message 42514 - Posted: 1 Jul 2011, 11:18:38 UTC - in response to Message 42512. Last modified: 1 Jul 2011, 11:32:28 UTC basically the same here...workunit hadam3p_eu_4hvn_1999_1_007334252_1 ran for 1min 5sec...it was the last of 4 hadam3p_eu workunits that failed in about the same length of time... ID: 42514 · Reply Quote

KWSN - Sir Frank of the Wood Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0	Message 42516 - Posted: 1 Jul 2011, 16:27:24 UTC - in response to Message 42514. Make that 5 workunits that failed in 63 to 74 seconds each...just noticed the 5th one... ID: 42516 · Reply Quote

Koert Send message Joined: 5 Sep 07 Posts: 9 Credit: 10,783,131 RAC: 0	Message 42518 - Posted: 1 Jul 2011, 18:41:35 UTC - in response to Message 42516. Well,my Hadam3p_eu_4g31_etc. started this morning at 05h 33m and gave finished at 05h 36m. That's a real quickie for more than 200hrs work isn't? :-) ID: 42518 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 42519 - Posted: 1 Jul 2011, 19:10:54 UTC Me too. Two EU Regionals failed after a few seconds each. I have a CM3i still going along successfully, so it doesn't look like my BOINC installation is at fault. I have set CPDN to No New Tasks until there is some clarity. Is this just early life forcing parameters being a bit fierce or is there something more fundamental wrong with this batch? ID: 42519 · Reply Quote

JohnofWem Send message Joined: 15 Feb 06 Posts: 16 Credit: 7,164,170 RAC: 7,950	Message 42520 - Posted: 1 Jul 2011, 19:33:25 UTC Same here. 9 failed immediately. I have set to no new tasks and, since I now have only 3 models for my 8 core restarted WCG ID: 42520 · Reply Quote

Thyme Lawn Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0	Message 42521 - Posted: 1 Jul 2011, 20:30:21 UTC - in response to Message 42519. Is this just early life forcing parameters being a bit fierce or is there something more fundamental wrong with this batch? There is definitely something amiss with this batch of hadam3p_eu work. Andy has been investigating the problem all day. From the speed of the crash and the stderr messages on the failed tasks I'd hazard a guess that it's related to the parameters being passed to the global worker (hadam3p_eu_um_) process. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer* ID: 42521 · Reply Quote

Darmok Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0	Message 42524 - Posted: 1 Jul 2011, 21:38:38 UTC - in response to Message 42521. Last modified: 1 Jul 2011, 21:46:02 UTC After having reattached to CPDN, 3 of 4 eu's downloaded together this am erred also in less than 1 min. Strangely, one is still running with correct graphics. Failed ones showed blue planets while stating "no model is running". Hope this helps. ID: 42524 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 42525 - Posted: 1 Jul 2011, 22:25:45 UTC Darmok, I see that your computer(s) are hidden. Could you please give us a link to the webpage for your EU model that is running? Cpdn news ID: 42525 · Reply Quote

Darmok Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0	Message 42526 - Posted: 1 Jul 2011, 22:50:19 UTC - in response to Message 42525. Last modified: 1 Jul 2011, 22:51:05 UTC Mo, Computer(s) now showing. Hasn't been any trickle yet. Model now running for 1.5 hours. ID: 42526 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 42527 - Posted: 2 Jul 2011, 7:09:52 UTC - in response to Message 42524. Last modified: 2 Jul 2011, 7:24:00 UTC After having reattached to CPDN, 3 of 4 eu's downloaded together this am erred also in less than 1 min. Strangely, one is still running with correct graphics. Failed ones showed blue planets while stating "no model is running". Hope this helps. the one that keeps on running is from a different batch -- hadam3p_eu2** not hadam3p_eu4 the 4* failures are all sigsegv or signal 11 -- same thing mostly -- not usually a model problem, more likely a compile problem or something more esoteric. Happening not on just your or my machines, but everywhere, as far as I can see from checking a few dozen leading cpus. I'm turning off eu downloads until early next week -- have a bunch of hadcm3n that will run for a long time Keep crunching, peace, happiness, this IS the bleeding edge sometimes -- noli perspirare Eric ID: 42527 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 42528 - Posted: 2 Jul 2011, 7:50:24 UTC - in response to Message 42526. Mo, Computer(s) now showing. Hasn't been any trickle yet. Model now running for 1.5 hours. Oh, and PS thanks for posting Eric ID: 42528 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 42529 - Posted: 2 Jul 2011, 11:04:21 UTC - in response to Message 42528. Last modified: 2 Jul 2011, 11:05:51 UTC Why never test these loser WUs before testing the volunteers bandwidth? A few hundred thousand times 100 or so MB -- what's that to a volunteer? These obviously never tested WU -- EU4 -- yeah, you can figure it out later, after wasting my time and bandwidth. Why never test before sending a gazillion to us? Huh? Don't you newb clowns test anything before you send a few bazillion WUs out? Forgive me, I've been volunteering my machine's time for a decade -- Did you try even one of these loser models at home before you sent a few hundred thou out to us? Don't think so.. It's obvious. Please -- don't abuse the volunteers. Do some minimal testing before you send a totally wasteful broken model times 300,000 to us crunchers. OK? Actually, I'm really annoyed by this last batch of broken s*** that I download, it breaks, ----- Do you do ANY testing before sending this stuff? No, obviously not. And yes, I, and a few others, are annoyed. If you dare, apologize. Eric ID: 42529 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 42531 - Posted: 2 Jul 2011, 14:45:11 UTC - in response to Message 42529. My understanding is that the wu's are autogenerated to a large extent. I guess it would be possible to put a few through some testing before putting out large numbers but that would imply a delay in some cases of a few weeks before it was known the units don't complete. Sure in this case where they fail after a few seconds perhaps it could have been picked up but this is the first batch I have come across that have failed that fast. I think Erik that this sort of thing is inevitable if a project of this magnitude is running on a shoestring budget. Dave ID: 42531 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 42532 - Posted: 2 Jul 2011, 14:52:52 UTC - in response to Message 42531. Last modified: 2 Jul 2011, 14:53:39 UTC "In this case" they all hundred thou fail after a few seconds. Not counting the download time These are regional EU models -- a new batch. Not regen as far as anyone can see. A very short-term test could easily have caught this problem. And when the lot of them started failing Thursday -- who was watching? Shoestring budget, yeah, I've posted before about that, it's true. But this batch looks like pure slovenly slop to me. Sorry to be offensive, but that's how it looks. My understanding is that the wu's are autogenerated to a large extent. I guess it would be possible to put a few through some testing before putting out large numbers but that would imply a delay in some cases of a few weeks before it was known the units don't complete. Sure in this case where they fail after a few seconds perhaps it could have been picked up but this is the first batch I have come across that have failed that fast. I think Erik that this sort of thing is inevitable if a project of this magnitude is running on a shoestring budget. Dave ID: 42532 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 42533 - Posted: 2 Jul 2011, 17:30:18 UTC - in response to Message 42532. In this case that does seem to be true - in the last one that crashed a lot of tasks they often did not crash till 100 hours in or more. Testing sufficient tasks to be sure it was universal would not have been realistic. this is the first time I have had almost instant crashes though. Dave ID: 42533 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 42534 - Posted: 2 Jul 2011, 17:35:59 UTC - in response to Message 42532. GET OVER IT EIRIK. In case you have forgotten, only last week all the complainants on this forum were about no work being available. ID: 42534 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 42535 - Posted: 2 Jul 2011, 22:13:00 UTC Let's keep it civil, please. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 42535 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038	Message 42536 - Posted: 2 Jul 2011, 23:09:28 UTC Last modified: 2 Jul 2011, 23:23:10 UTC It is not normal practice to test batches of work units that are expected to run without problems. The HADAM3P application has been run uneventfully thousands of times and no discussion of which I am aware has suggested that this batch is novel in a way that would have justified a beta test. The CPDN beta site is used to test models that are not expected to work first time, so that wouldn't be the place to test a routine release. Clearly, however, something is badly wrong. The units were released at close of play UK time on Thursday. Some failures were reported overnight and a formal report of that made by a moderator to the project team early Friday morning. That's about as quick as it could get, I think. Investigations then went on through Friday. It has been the case in the past that batches sometimes fail because a file is missing or there is some correctable problem server-side or only part of a batch is affected. The first response is not therefore to pull the batch but to determine the cause and fix the batch if possible. A definitive cause has not yet been reported anywhere that I can see. The fact that so many models are affected probably means that the cause will be relatively easy to find. In retrospect, of course, a test batch of a few tens of units would have saved a lot of wasted bandwidth. That has been done in the past for transfers from beta, for example, or other potentially suspect changes, but this batch was thought to be routine. Past work unit problems have ranged from culpable failures in configuration management through spelling errors to blameless weirdnesses in the HADSM/CM/AM models. It doesn't look like there'll be any EU models left by Monday morning, so we'll just have to wait for a fix to be applied and a new batch to be released - and perhaps as a mercy it will be released a bit at a time. ID: 42536 · Reply Quote