climateprediction.net (CPDN) home page
Thread 'Compute Errors on Pacific North West v7.22 Tasks'

Thread 'Compute Errors on Pacific North West v7.22 Tasks'

Message boards : Number crunching : Compute Errors on Pacific North West v7.22 Tasks
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 49605 - Posted: 22 Jul 2014, 10:39:32 UTC - in response to Message 49598.  

If you have a restriction on your Internet connection then you may be eating into your allowance to no great purpose and at a rather higher rate than would apply if the tasks were valid. Otherwise, there's no harm.

There is no concept, on this project, of an automatically-detected unreliable computer (or indeed a reliable one). There is the 'minussing' procedure but that's manual, in which badly behaved computers have their task download limit manually set to -1 as the result of a report here on the message boards.

To add to what Iain said, the moderators always filter out reports before asking the project team to 'minus' a computer. Computers with a bunch of INITTIME (or other workunit parameter) errors would never end up being 'minussed', but it would be flagged up if it looked like a previously unidentified problem with a batch of work.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 49605 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 49609 - Posted: 23 Jul 2014, 14:19:17 UTC
Last modified: 23 Jul 2014, 14:27:04 UTC

To put it more succinctly -

When one of the "scientist" groups submits a batch of misconfigured impossible models - it may waste volunteers bandwith - but that's not a problem.

The model software is complicated and difficult. You expect "scientists" to understand the details? Huh?

Our cpu time and bandwidth is free - for the "amazing scientists" --
Oops - left out a few parameters -- "sorry" actually not sorry at all - it costs the "scientists" zero. They don't care about our bandwidth. They maybe catch a bit of grief if they waste cpu time on their local supercomputer.
This kind of "bad batch" will keep on happening -

And I do not expect any of the"scientists" to ever apologize for submitting a bad batch..
But - I keep on crunching

Cause most of the models seem worthwhile.

These stupid waste-of-time-and-bandwith blunderbatches will keep on happening.
Because the "scientists" reasonably don't care.

Not a real problem.

Scientifically.

Politically, maybe .

PS - obviously the mods can "minus"
a broken computer --

but nobody can "minus" a serious reseacher who just happens to not understand the software, and keeps wasting everybody's time and
bandwidth.
ID: 49609 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 49612 - Posted: 23 Jul 2014, 15:11:50 UTC - in response to Message 49609.  

it costs the "scientists" zero. They don't care about our bandwidth.


Not entirely sure that is true Erik.

The scientists pay (not out of their own pocket of course) Oxford University to put the models onto the system for them and to get the results that they need. Whether they have to pay more to the project if they submit a duff batch I don't know and the information may be, "commercially sensitive!"
ID: 49612 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 49613 - Posted: 23 Jul 2014, 16:25:12 UTC

Yeah, right - actually --"researchers" can use my CPU time.
What we will never know - is --

is the "research center" mostly competent?? no way a contributor can tell.
No way other researchers will tell us volunteers - have to wait for "peer reviewed"

Or is their submitting a "blunderbatch" evidence that their "research center" is disorganized and incompenent?

No way for me to tell.

Wish there some way to refuse tasks from researchers that don't care about getting their models to meet minimum standards of computability.

Obviously -no way any researcher will give us a blacklist- no way no how.

But - some batches have been total misconfigured crap. AND I want to get back at the obviously incompetent clowns that wasted my CPU time.
BUT - won't ever happen

The incompetent bozos that submit these broken batches are obviously careless, or worse. WHO they are.well, we shouldn't have hired this or that clown.

I'm only moderately annoyed that some incompetent bozo wastes my CPU time.

It happens.

ID: 49613 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 49614 - Posted: 23 Jul 2014, 18:47:24 UTC - in response to Message 49609.  

... it costs the "scientists" zero. They don't care about our bandwidth. ... This kind of "bad batch" will keep on happening ...

This won't come as a surprise to those people interested in environmental matters: abuse of the environmental "resource" will no doubt also keep on happening until it costs significantly more than zero. Not being able to see beyond the end of our noses appears to be an enduring characteristic of the human animal: we just don't seem to be able to be that little bit better - either here or more generally. :-(
ID: 49614 · Report as offensive     Reply Quote
MossyRock
Avatar

Send message
Joined: 4 Oct 13
Posts: 27
Credit: 2,301,681
RAC: 7,632
Message 49638 - Posted: 25 Jul 2014, 12:51:40 UTC - in response to Message 49613.  
Last modified: 25 Jul 2014, 12:55:56 UTC

Wish there some way to refuse tasks from researchers that don't care about getting their models to meet minimum standards of computability.

Erik,

If you experience a set of models that aren't running correctly, report it and then just untick the offending application in your preferences and be done with it. When the problem is reported as fixed go back in and re-tick it.

If you've ever done any programming, you will know that even the best programmers in the world make mistakes. Given the complexity of the code for the CPDN models, I'm amazed that they run as well as they do.
ID: 49638 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 49648 - Posted: 26 Jul 2014, 11:13:48 UTC - in response to Message 49638.  

Wish there some way to refuse tasks from researchers that don't care about getting their models to meet minimum standards of computability.

Erik,

If you experience a set of models that aren't running correctly, report it and then just untick the offending application in your preferences and be done with it. When the problem is reported as fixed go back in and re-tick it.

If you've ever done any programming, you will know that even the best programmers in the world make mistakes. Given the complexity of the code for the CPDN models, I'm amazed that they run as well as they do.


I also -- understand that submitting models to the old, tested, million lines-of-code FORTRAN thing -- requires skills with ancient fortran code that most noob underpaid underlings in academia just dont have.

I think I have a clue -- based on word from a long -time friend who used to run short-term forecasts for the USAF -

You can't have enough observations.

You also can't have enough models.

I believe we are doing the best with what we've got.

I will keep on crunching.

The occasional batches of "work" that are totally misconfigured with mismatching params -- hey expected, happens

But I reserve the right to complain loudly when some misconfigured batch
plugs up my limited download bandwidth. As a volunteer, I think I have a right to complain about that.

ID: 49648 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49651 - Posted: 26 Jul 2014, 14:25:47 UTC - in response to Message 49600.  
Last modified: 26 Jul 2014, 14:26:03 UTC

Earlier I wrote:
Progress, perhaps? Only a few minutes in on these new tasks that were created earlier today, but at least they didn't crash right away.

Workunit 9031281
Workunit 9031276

Workunit 9031281 finished okay. This is one of four PNWs I got that were created about 5 days after the first batch. Two of them are still running, but Workunit 9031276 crashed after about 20 hours.
ID: 49651 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 49653 - Posted: 27 Jul 2014, 6:17:28 UTC

Remember that even with correctly configured models, some tasks will crash due to an impossible climate being created. e.g. a -ve pressure.

However looking at the task which crashed after 20 hours, this is not one of those.
ID: 49653 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 49664 - Posted: 30 Jul 2014, 23:21:58 UTC

Twelve PNW models on two Windows machines of mine have just failed, so it looks like the 30 July batch might have problems ...
ID: 49664 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49665 - Posted: 31 Jul 2014, 0:57:41 UTC - in response to Message 49664.  
Last modified: 31 Jul 2014, 1:06:35 UTC

Iain Inglis wrote:
...so it looks like the 30 July batch might have problems ...

Not what I was hoping to see when I checked this thread... :-( I've got 11 of the new ones running (on 5 Win7-64 hosts) and so far, so good, after between 4-10 hours of run time. We'll see how long that lasts.

[edit]My hosts are making the second go around at about half of the tasks I've got. Of the hosts that tried and failed the first time, there's a mix of Linux and Windows OSs. Linux machines didn't even get off the ground (missing libraries?) and the Windows machines failed early on with various errors.[/edit]
ID: 49665 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 49674 - Posted: 31 Jul 2014, 14:20:29 UTC

... there may be a BOINC version angle on the PNW failures, such that earler BOINC versions could succeed. No doubt someone who understands the details of that problem will explain in due course.
ID: 49674 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 49675 - Posted: 31 Jul 2014, 16:05:09 UTC - in response to Message 49674.  

... there may be a BOINC version angle on the PNW failures, such that earler BOINC versions could succeed. No doubt someone who understands the details of that problem will explain in due course.

A BOINC version problem (well, the one I know a little about, anyway) would only apply to the Windows OS application.

As always when bug-hunting, information is the key. If you come across a failure that you can't explain for yourself (some are obvious, like "Model crashed: INITTIME: Atmosphere basis time mismatch"), please always give us:

Operating system
Version of BOINC in use
Whether BOINC is installed 'as a service' or in user mode
Any error message you can find

and a link to an example of the task(s) in question, in case there's any more to be gleaned there.

ID: 49675 · Report as offensive     Reply Quote
Waldmeister

Send message
Joined: 13 Jun 11
Posts: 34
Credit: 1,418,371
RAC: 444
Message 49761 - Posted: 18 Aug 2014, 15:01:13 UTC

Anybody having first experience with the newest PNW-Batch (haven't startet my three tasks yet)?
ID: 49761 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49762 - Posted: 18 Aug 2014, 15:31:19 UTC - in response to Message 49761.  

Waldmeister wrote:
Anybody having first experience with the newest PNW-Batch...?

I ran 3 for about an hour or more without any problems. (However, I aborted them because I don't want the work right now. I thought I had my hosts set to no new tasks.)

ID: 49762 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 49763 - Posted: 18 Aug 2014, 15:43:42 UTC

I just download 6 new tasks Pacific North West successfuly,

sample:

Workunit 9042697
name hadam3p_pnw_f1e5_2010_1_008896914
application UK Met Office HadAM3P-HadRM3P Pacific North West
created 18 Aug 2014 12:11:11 UTC

I'll keep an eye on them see how they go.

18/08/2014 3:12:38 AM | climateprediction.net | Reporting 2 completed tasks
18/08/2014 3:12:38 AM | climateprediction.net | Requesting new tasks for CPU
18/08/2014 3:12:43 AM | climateprediction.net | Scheduler request completed: got 6 new tasks
18/08/2014 3:12:45 AM | climateprediction.net | Started download of hadam3p_pnw_uitn_1996_1_008890995.zip
18/08/2014 3:12:45 AM | climateprediction.net | Started download of ic19610318_16_N96.gz
18/08/2014 3:12:47 AM | climateprediction.net | Finished download of hadam3p_pnw_uitn_1996_1_008890995.zip
ID: 49763 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 49767 - Posted: 18 Aug 2014, 20:06:36 UTC

just an update my computer DL an EU workunit

18/08/2014 10:58:01 AM | climateprediction.net | Requesting new tasks for CPU
18/08/2014 10:58:05 AM | climateprediction.net | Scheduler request completed: got 1 new tasks
18/08/2014 10:58:07 AM | climateprediction.net | Started download of hadam3p_eu_jc5u_2013_1_008795238.zip
18/08/2014 10:58:07 AM | climateprediction.net | Started download of ic19610718_10_N96.gz
18/08/2014 10:58:09 AM | climateprediction.net | Finished download of hadam3p_eu_jc5u_2013_1_008795238.zip
18/08/2014 10:58:09 AM | climateprediction.net | Started download of atmos_n0ep.day.gz
18/08/2014 10:58:15 AM | climateprediction.net | Finished download of ic19610718_10_N96.gz
18/08/2014 10:58:15 AM | climateprediction.net | Started download of region_n0ep.day.gz
18/08/2014 10:59:13 AM | climateprediction.net | Finished download of atmos_n0ep.day.gz
18/08/2014 10:59:13 AM | climateprediction.net | Started download of ancil_OSTIA_deltaSST_2014_GISS-E2-H.gz
18/08/2014 10:59:17 AM | climateprediction.net | Finished download of region_n0ep.day.gz
18/08/2014 10:59:29 AM | climateprediction.net | Finished download of ancil_OSTIA_deltaSST_2014_GISS-E2-H.gz
18/08/2014 11:04:20 AM | climateprediction.net | Finished upload of hadam3p_eu_lcex_2013_1_008827314_0_8.zip
18/08/2014 12:53:23 PM | climateprediction.net | Started upload of hadam3p_eu_ld2s_2013_1_008828173_0_13.zip
18/08/2014 12:53:26 PM | climateprediction.net | Computation for task hadam3p_eu_ld2s_2013_1_008828173_0 finished
18/08/2014 12:53:26 PM | climateprediction.net | Output file hadam3p_eu_ld2s_2013_1_008828173_0_12.zip for task hadam3p_eu_ld2s_2013_1_008828173_0 absent
18/08/2014 12:53:26 PM | climateprediction.net | Starting task hadam3p_eu_n95e_2013_1_008806354_0
18/08/2014 12:57:01 PM | climateprediction.net | Finished upload of hadam3p_eu_ld2s_2013_1_008828173_0_13.zip

ID: 49767 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49768 - Posted: 18 Aug 2014, 20:56:31 UTC - in response to Message 49761.  

Waldmeister

I have 6 PNWs running, but that's on Linux.
They've just uploaded their first zips.


ID: 49768 · Report as offensive     Reply Quote
Albert H.

Send message
Joined: 18 Feb 06
Posts: 73
Credit: 61,754,697
RAC: 46,486
Message 49775 - Posted: 19 Aug 2014, 19:24:50 UTC

Not only PNW tasks have compute errors.!
I had today 9 HadCM3 short 7.24 models running...but only for "short" time, each for max 20 to 34 minutes, then error
Is this normal or experimental?

Thanks
ID: 49775 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 49776 - Posted: 19 Aug 2014, 19:40:10 UTC - in response to Message 49775.  

Not only PNW tasks have compute errors.!
I had today 9 HadCM3 short 7.24 models running...but only for "short" time, each for max 20 to 34 minutes, then error
Is this normal or experimental?

The models of yours that have failed are reporting "invalid theta" errors. That is an unphysical simulation, which happens occasionally. To have so many at one time suggests either a very unfortunate set of model parameters or a model configuration error.

In any event, there's nothing you can do about that particular problem (unless you're massively overclocked so that the calculations are invalid).
ID: 49776 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Compute Errors on Pacific North West v7.22 Tasks

©2024 cpdn.org