climateprediction.net (CPDN) home page
Thread 'hadam3p_eu crash 45 seconds in.'

Thread 'hadam3p_eu crash 45 seconds in.'

Message boards : Number crunching : hadam3p_eu crash 45 seconds in.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,012,300
RAC: 21,119
Message 42512 - Posted: 1 Jul 2011, 10:12:18 UTC

Following the post suggesting a different model series be aborted I aborted two of those models, downloaded 2 hadam3p_eu models, both crashed, one at 44 seconds and one at 45 seconds. I have successfully completed regional models in the past without issue on my Intel I5 linux box. Any ideas?
ID: 42512 · Report as offensive     Reply Quote
Nigel Garvey

Send message
Joined: 5 May 10
Posts: 69
Credit: 1,169,103
RAC: 2,258
Message 42513 - Posted: 1 Jul 2011, 11:15:05 UTC

I picked up one this morning which had previously errored after a few seconds on two other machines.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=7523642

It may not be started for a while on my machine as I have a hadcm3n running in high priority mode and three other projects.
ID: 42513 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 42514 - Posted: 1 Jul 2011, 11:18:38 UTC - in response to Message 42512.  
Last modified: 1 Jul 2011, 11:32:28 UTC

basically the same here...workunit hadam3p_eu_4hvn_1999_1_007334252_1 ran for 1min 5sec...it was the last of 4 hadam3p_eu workunits that failed in about the same length of time...
ID: 42514 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 42516 - Posted: 1 Jul 2011, 16:27:24 UTC - in response to Message 42514.  

Make that 5 workunits that failed in 63 to 74 seconds each...just noticed the 5th one...
ID: 42516 · Report as offensive     Reply Quote
Koert

Send message
Joined: 5 Sep 07
Posts: 9
Credit: 10,783,131
RAC: 0
Message 42518 - Posted: 1 Jul 2011, 18:41:35 UTC - in response to Message 42516.  

Well,my Hadam3p_eu_4g31_etc. started this morning at 05h 33m and gave finished
at 05h 36m. That's a real quickie for more than 200hrs work isn't? :-)


ID: 42518 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 42519 - Posted: 1 Jul 2011, 19:10:54 UTC

Me too. Two EU Regionals failed after a few seconds each. I have a CM3i still going along successfully, so it doesn't look like my BOINC installation is at fault. I have set CPDN to No New Tasks until there is some clarity. Is this just early life forcing parameters being a bit fierce or is there something more fundamental wrong with this batch?
ID: 42519 · Report as offensive     Reply Quote
ProfileJohnofWem
Avatar

Send message
Joined: 15 Feb 06
Posts: 16
Credit: 7,140,148
RAC: 8,562
Message 42520 - Posted: 1 Jul 2011, 19:33:25 UTC

Same here. 9 failed immediately. I have set to no new tasks and, since I now have only 3 models for my 8 core restarted WCG
ID: 42520 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 42521 - Posted: 1 Jul 2011, 20:30:21 UTC - in response to Message 42519.  

Is this just early life forcing parameters being a bit fierce or is there something more fundamental wrong with this batch?

There is definitely something amiss with this batch of hadam3p_eu work. Andy has been investigating the problem all day. From the speed of the crash and the stderr messages on the failed tasks I'd hazard a guess that it's related to the parameters being passed to the global worker (hadam3p_eu_um_*) process.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 42521 · Report as offensive     Reply Quote
Darmok

Send message
Joined: 29 Dec 09
Posts: 34
Credit: 18,395,130
RAC: 0
Message 42524 - Posted: 1 Jul 2011, 21:38:38 UTC - in response to Message 42521.  
Last modified: 1 Jul 2011, 21:46:02 UTC

After having reattached to CPDN, 3 of 4 eu's downloaded together this am erred also in less than 1 min. Strangely, one is still running with correct graphics. Failed ones showed blue planets while stating "no model is running".
Hope this helps.
ID: 42524 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 42525 - Posted: 1 Jul 2011, 22:25:45 UTC

Darmok, I see that your computer(s) are hidden. Could you please give us a link to the webpage for your EU model that is running?
Cpdn news
ID: 42525 · Report as offensive     Reply Quote
Darmok

Send message
Joined: 29 Dec 09
Posts: 34
Credit: 18,395,130
RAC: 0
Message 42526 - Posted: 1 Jul 2011, 22:50:19 UTC - in response to Message 42525.  
Last modified: 1 Jul 2011, 22:51:05 UTC

Mo,
Computer(s) now showing. Hasn't been any trickle yet. Model now running for 1.5 hours.
ID: 42526 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 42527 - Posted: 2 Jul 2011, 7:09:52 UTC - in response to Message 42524.  
Last modified: 2 Jul 2011, 7:24:00 UTC

After having reattached to CPDN, 3 of 4 eu's downloaded together this am erred also in less than 1 min. Strangely, one is still running with correct graphics. Failed ones showed blue planets while stating "no model is running".
Hope this helps.


the one that keeps on running is from a different batch -- hadam3p_eu2**** not hadam3p_eu4**

the 4* failures are all sigsegv or signal 11 -- same thing mostly -- not usually a model problem, more likely a compile problem or something more esoteric.
Happening not on just your or my machines, but everywhere, as far as I can see from checking a few dozen leading cpus.

I'm turning off eu downloads until early next week -- have a bunch of hadcm3n that will run for a long time

Keep crunching, peace, happiness, this IS the bleeding edge sometimes --

noli perspirare

Eric
ID: 42527 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 42528 - Posted: 2 Jul 2011, 7:50:24 UTC - in response to Message 42526.  

Mo,
Computer(s) now showing. Hasn't been any trickle yet. Model now running for 1.5 hours.


Oh, and PS

thanks for posting

Eric
ID: 42528 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 42529 - Posted: 2 Jul 2011, 11:04:21 UTC - in response to Message 42528.  
Last modified: 2 Jul 2011, 11:05:51 UTC

Why never test these loser WUs before testing the volunteers bandwidth?
A few hundred thousand times 100 or so MB -- what's that to a volunteer?
These obviously never tested WU -- EU4 -- yeah, you can figure it out later, after wasting my time and bandwidth.
Why never test before sending a gazillion to us?
Huh?

Don't you newb clowns test anything before you send a few bazillion WUs out?

Forgive me, I've been volunteering my machine's time for a decade --
Did you try even one of these loser models at home before you sent a few hundred thou out to us? Don't think so.. It's obvious.

Please -- don't abuse the volunteers.

Do some minimal testing before you send a totally wasteful broken model times 300,000 to us crunchers. OK?

Actually, I'm really annoyed by this last batch of broken s*** that I download, it breaks, -----

Do you do ANY testing before sending this stuff?

No, obviously not.

And yes, I, and a few others, are annoyed.

If you dare, apologize.

Eric
ID: 42529 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,012,300
RAC: 21,119
Message 42531 - Posted: 2 Jul 2011, 14:45:11 UTC - in response to Message 42529.  

My understanding is that the wu's are autogenerated to a large extent. I guess it would be possible to put a few through some testing before putting out large numbers but that would imply a delay in some cases of a few weeks before it was known the units don't complete. Sure in this case where they fail after a few seconds perhaps it could have been picked up but this is the first batch I have come across that have failed that fast. I think Erik that this sort of thing is inevitable if a project of this magnitude is running on a shoestring budget.

Dave
ID: 42531 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 42532 - Posted: 2 Jul 2011, 14:52:52 UTC - in response to Message 42531.  
Last modified: 2 Jul 2011, 14:53:39 UTC

"In this case" they all hundred thou fail after a few seconds. Not counting the download time
These are regional EU models -- a new batch. Not regen as far as anyone can see.
A very short-term test could easily have caught this problem.
And when the lot of them started failing Thursday -- who was watching?
Shoestring budget, yeah, I've posted before about that, it's true.
But this batch looks like pure slovenly slop to me.
Sorry to be offensive, but that's how it looks.


My understanding is that the wu's are autogenerated to a large extent. I guess it would be possible to put a few through some testing before putting out large numbers but that would imply a delay in some cases of a few weeks before it was known the units don't complete. Sure in this case where they fail after a few seconds perhaps it could have been picked up but this is the first batch I have come across that have failed that fast. I think Erik that this sort of thing is inevitable if a project of this magnitude is running on a shoestring budget.

Dave

ID: 42532 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,012,300
RAC: 21,119
Message 42533 - Posted: 2 Jul 2011, 17:30:18 UTC - in response to Message 42532.  

In this case that does seem to be true - in the last one that crashed a lot of tasks they often did not crash till 100 hours in or more. Testing sufficient tasks to be sure it was universal would not have been realistic. this is the first time I have had almost instant crashes though.

Dave
ID: 42533 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42534 - Posted: 2 Jul 2011, 17:35:59 UTC - in response to Message 42532.  

GET OVER IT EIRIK. In case you have forgotten, only last week all the complainants on this forum were about no work being available.

ID: 42534 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 42535 - Posted: 2 Jul 2011, 22:13:00 UTC

Let's keep it civil, please.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 42535 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 42536 - Posted: 2 Jul 2011, 23:09:28 UTC
Last modified: 2 Jul 2011, 23:23:10 UTC

It is not normal practice to test batches of work units that are expected to run without problems. The HADAM3P application has been run uneventfully thousands of times and no discussion of which I am aware has suggested that this batch is novel in a way that would have justified a beta test. The CPDN beta site is used to test models that are not expected to work first time, so that wouldn't be the place to test a routine release. Clearly, however, something is badly wrong.

The units were released at close of play UK time on Thursday. Some failures were reported overnight and a formal report of that made by a moderator to the project team early Friday morning. That's about as quick as it could get, I think. Investigations then went on through Friday.

It has been the case in the past that batches sometimes fail because a file is missing or there is some correctable problem server-side or only part of a batch is affected. The first response is not therefore to pull the batch but to determine the cause and fix the batch if possible. A definitive cause has not yet been reported anywhere that I can see. The fact that so many models are affected probably means that the cause will be relatively easy to find.

In retrospect, of course, a test batch of a few tens of units would have saved a lot of wasted bandwidth. That has been done in the past for transfers from beta, for example, or other potentially suspect changes, but this batch was thought to be routine. Past work unit problems have ranged from culpable failures in configuration management through spelling errors to blameless weirdnesses in the HADSM/CM/AM models. It doesn't look like there'll be any EU models left by Monday morning, so we'll just have to wait for a fix to be applied and a new batch to be released - and perhaps as a mercy it will be released a bit at a time.
ID: 42536 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : hadam3p_eu crash 45 seconds in.

©2024 cpdn.org