Message boards : Number crunching : Compute Errors on Pacific North West v7.22 Tasks
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
If you have a restriction on your Internet connection then you may be eating into your allowance to no great purpose and at a rather higher rate than would apply if the tasks were valid. Otherwise, there's no harm. To add to what Iain said, the moderators always filter out reports before asking the project team to 'minus' a computer. Computers with a bunch of INITTIME (or other workunit parameter) errors would never end up being 'minussed', but it would be flagged up if it looked like a previously unidentified problem with a batch of work. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
To put it more succinctly - When one of the "scientist" groups submits a batch of misconfigured impossible models - it may waste volunteers bandwith - but that's not a problem. The model software is complicated and difficult. You expect "scientists" to understand the details? Huh? Our cpu time and bandwidth is free - for the "amazing scientists" -- Oops - left out a few parameters -- "sorry" actually not sorry at all - it costs the "scientists" zero. They don't care about our bandwidth. They maybe catch a bit of grief if they waste cpu time on their local supercomputer. This kind of "bad batch" will keep on happening - And I do not expect any of the"scientists" to ever apologize for submitting a bad batch.. But - I keep on crunching Cause most of the models seem worthwhile. These stupid waste-of-time-and-bandwith blunderbatches will keep on happening. Because the "scientists" reasonably don't care. Not a real problem. Scientifically. Politically, maybe . PS - obviously the mods can "minus" a broken computer -- but nobody can "minus" a serious reseacher who just happens to not understand the software, and keeps wasting everybody's time and bandwidth. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
it costs the "scientists" zero. They don't care about our bandwidth. Not entirely sure that is true Erik. The scientists pay (not out of their own pocket of course) Oxford University to put the models onto the system for them and to get the results that they need. Whether they have to pay more to the project if they submit a duff batch I don't know and the information may be, "commercially sensitive!" |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Yeah, right - actually --"researchers" can use my CPU time. What we will never know - is -- is the "research center" mostly competent?? no way a contributor can tell. No way other researchers will tell us volunteers - have to wait for "peer reviewed" Or is their submitting a "blunderbatch" evidence that their "research center" is disorganized and incompenent? No way for me to tell. Wish there some way to refuse tasks from researchers that don't care about getting their models to meet minimum standards of computability. Obviously -no way any researcher will give us a blacklist- no way no how. But - some batches have been total misconfigured crap. AND I want to get back at the obviously incompetent clowns that wasted my CPU time. BUT - won't ever happen The incompetent bozos that submit these broken batches are obviously careless, or worse. WHO they are.well, we shouldn't have hired this or that clown. I'm only moderately annoyed that some incompetent bozo wastes my CPU time. It happens. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,855,177 RAC: 4,773 |
... it costs the "scientists" zero. They don't care about our bandwidth. ... This kind of "bad batch" will keep on happening ... This won't come as a surprise to those people interested in environmental matters: abuse of the environmental "resource" will no doubt also keep on happening until it costs significantly more than zero. Not being able to see beyond the end of our noses appears to be an enduring characteristic of the human animal: we just don't seem to be able to be that little bit better - either here or more generally. :-( |
Send message Joined: 4 Oct 13 Posts: 27 Credit: 2,301,681 RAC: 7,632 |
Wish there some way to refuse tasks from researchers that don't care about getting their models to meet minimum standards of computability. Erik, If you experience a set of models that aren't running correctly, report it and then just untick the offending application in your preferences and be done with it. When the problem is reported as fixed go back in and re-tick it. If you've ever done any programming, you will know that even the best programmers in the world make mistakes. Given the complexity of the code for the CPDN models, I'm amazed that they run as well as they do. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Wish there some way to refuse tasks from researchers that don't care about getting their models to meet minimum standards of computability. I also -- understand that submitting models to the old, tested, million lines-of-code FORTRAN thing -- requires skills with ancient fortran code that most noob underpaid underlings in academia just dont have. I think I have a clue -- based on word from a long -time friend who used to run short-term forecasts for the USAF - You can't have enough observations. You also can't have enough models. I believe we are doing the best with what we've got. I will keep on crunching. The occasional batches of "work" that are totally misconfigured with mismatching params -- hey expected, happens But I reserve the right to complain loudly when some misconfigured batch plugs up my limited download bandwidth. As a volunteer, I think I have a right to complain about that. |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Earlier I wrote: Progress, perhaps? Only a few minutes in on these new tasks that were created earlier today, but at least they didn't crash right away. Workunit 9031281 finished okay. This is one of four PNWs I got that were created about 5 days after the first batch. Two of them are still running, but Workunit 9031276 crashed after about 20 hours. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Remember that even with correctly configured models, some tasks will crash due to an impossible climate being created. e.g. a -ve pressure. However looking at the task which crashed after 20 hours, this is not one of those. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,855,177 RAC: 4,773 |
Twelve PNW models on two Windows machines of mine have just failed, so it looks like the 30 July batch might have problems ... |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Iain Inglis wrote: ...so it looks like the 30 July batch might have problems ... Not what I was hoping to see when I checked this thread... :-( I've got 11 of the new ones running (on 5 Win7-64 hosts) and so far, so good, after between 4-10 hours of run time. We'll see how long that lasts. [edit]My hosts are making the second go around at about half of the tasks I've got. Of the hosts that tried and failed the first time, there's a mix of Linux and Windows OSs. Linux machines didn't even get off the ground (missing libraries?) and the Windows machines failed early on with various errors.[/edit] |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,855,177 RAC: 4,773 |
... there may be a BOINC version angle on the PNW failures, such that earler BOINC versions could succeed. No doubt someone who understands the details of that problem will explain in due course. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,743,089 RAC: 6,177 |
... there may be a BOINC version angle on the PNW failures, such that earler BOINC versions could succeed. No doubt someone who understands the details of that problem will explain in due course. A BOINC version problem (well, the one I know a little about, anyway) would only apply to the Windows OS application. As always when bug-hunting, information is the key. If you come across a failure that you can't explain for yourself (some are obvious, like "Model crashed: INITTIME: Atmosphere basis time mismatch"), please always give us: Operating system Version of BOINC in use Whether BOINC is installed 'as a service' or in user mode Any error message you can find and a link to an example of the task(s) in question, in case there's any more to be gleaned there. |
Send message Joined: 13 Jun 11 Posts: 34 Credit: 1,429,139 RAC: 1,020 |
Anybody having first experience with the newest PNW-Batch (haven't startet my three tasks yet)? |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Waldmeister wrote: Anybody having first experience with the newest PNW-Batch...? I ran 3 for about an hour or more without any problems. (However, I aborted them because I don't want the work right now. I thought I had my hosts set to no new tasks.) |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
I just download 6 new tasks Pacific North West successfuly, sample: Workunit 9042697 name hadam3p_pnw_f1e5_2010_1_008896914 application UK Met Office HadAM3P-HadRM3P Pacific North West created 18 Aug 2014 12:11:11 UTC I'll keep an eye on them see how they go. 18/08/2014 3:12:38 AM | climateprediction.net | Reporting 2 completed tasks 18/08/2014 3:12:38 AM | climateprediction.net | Requesting new tasks for CPU 18/08/2014 3:12:43 AM | climateprediction.net | Scheduler request completed: got 6 new tasks 18/08/2014 3:12:45 AM | climateprediction.net | Started download of hadam3p_pnw_uitn_1996_1_008890995.zip 18/08/2014 3:12:45 AM | climateprediction.net | Started download of ic19610318_16_N96.gz 18/08/2014 3:12:47 AM | climateprediction.net | Finished download of hadam3p_pnw_uitn_1996_1_008890995.zip |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
just an update my computer DL an EU workunit 18/08/2014 10:58:01 AM | climateprediction.net | Requesting new tasks for CPU 18/08/2014 10:58:05 AM | climateprediction.net | Scheduler request completed: got 1 new tasks 18/08/2014 10:58:07 AM | climateprediction.net | Started download of hadam3p_eu_jc5u_2013_1_008795238.zip 18/08/2014 10:58:07 AM | climateprediction.net | Started download of ic19610718_10_N96.gz 18/08/2014 10:58:09 AM | climateprediction.net | Finished download of hadam3p_eu_jc5u_2013_1_008795238.zip 18/08/2014 10:58:09 AM | climateprediction.net | Started download of atmos_n0ep.day.gz 18/08/2014 10:58:15 AM | climateprediction.net | Finished download of ic19610718_10_N96.gz 18/08/2014 10:58:15 AM | climateprediction.net | Started download of region_n0ep.day.gz 18/08/2014 10:59:13 AM | climateprediction.net | Finished download of atmos_n0ep.day.gz 18/08/2014 10:59:13 AM | climateprediction.net | Started download of ancil_OSTIA_deltaSST_2014_GISS-E2-H.gz 18/08/2014 10:59:17 AM | climateprediction.net | Finished download of region_n0ep.day.gz 18/08/2014 10:59:29 AM | climateprediction.net | Finished download of ancil_OSTIA_deltaSST_2014_GISS-E2-H.gz 18/08/2014 11:04:20 AM | climateprediction.net | Finished upload of hadam3p_eu_lcex_2013_1_008827314_0_8.zip 18/08/2014 12:53:23 PM | climateprediction.net | Started upload of hadam3p_eu_ld2s_2013_1_008828173_0_13.zip 18/08/2014 12:53:26 PM | climateprediction.net | Computation for task hadam3p_eu_ld2s_2013_1_008828173_0 finished 18/08/2014 12:53:26 PM | climateprediction.net | Output file hadam3p_eu_ld2s_2013_1_008828173_0_12.zip for task hadam3p_eu_ld2s_2013_1_008828173_0 absent 18/08/2014 12:53:26 PM | climateprediction.net | Starting task hadam3p_eu_n95e_2013_1_008806354_0 18/08/2014 12:57:01 PM | climateprediction.net | Finished upload of hadam3p_eu_ld2s_2013_1_008828173_0_13.zip |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Waldmeister I have 6 PNWs running, but that's on Linux. They've just uploaded their first zips. |
Send message Joined: 18 Feb 06 Posts: 73 Credit: 62,054,875 RAC: 47,616 |
Not only PNW tasks have compute errors.! I had today 9 HadCM3 short 7.24 models running...but only for "short" time, each for max 20 to 34 minutes, then error Is this normal or experimental? Thanks |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,855,177 RAC: 4,773 |
Not only PNW tasks have compute errors.! The models of yours that have failed are reporting "invalid theta" errors. That is an unphysical simulation, which happens occasionally. To have so many at one time suggests either a very unfortunate set of model parameters or a model configuration error. In any event, there's nothing you can do about that particular problem (unless you're massively overclocked so that the calculations are invalid). |
©2024 cpdn.org