Message boards : Number crunching : 100 hour bug?
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0 |
These 3 models failed on my computer after running for a bit over 100 hours. The last trickle was for timestep 233280. Coincidence? http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12991505 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12990998 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12990504 They failed with the "invalid theta" mesaage. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Hadcm3n_s3h8_1940_40_007299462_1 crashed due to “invalid theta” after timestep 233,000 (about 150 hours). These seem to be as unstable as the Famous models were. Does anyone know what percentage of them are failing. Because of the length of the HadCMn models if they are to unstable the attrition rate could be so high that they are not worth running. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
The ' s* ' series was/is unstable. New work is from the ' t* ' series and should be okay. (I hope so! I have several of them.) The fresh lot can be identified in the ID by ' _t***_ ' and we have high hopes for them . . . "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
The server status page keeps growing from a few hundred to 2500 as of now -- hoping these t*** series run longer In any case the thing to do is keep on crunching Status page available wu keeps growing these last few hours -- hope these new wu run to completion |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
ALL climate models run to completion. It's just that "completion" isn't necessarily the full possible length. :) It's stated on the RAPIT/RAPID explanation page that some models are expected to fail, because of the extreme forcings and parameter values used. It all depends on how adventurous the researchers decide to get. :) Backups: Here |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Very clear -- models that die with "NEGATIVE THETA" or other similar failures are NOT wasted -- the researchers can learn what possible combinations are consistent with plausible scenarios, and what are not possible. Please keep processing whatever models you get -- again -- keep on crunching ALL climate models run to completion. It's just that "completion" isn't necessarily the full possible length. :) |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Thanks for the info. I still have 1 of the “S” series WU’s running on my slower machine at about 19%. I was wondering, given the instability of this series, whether it was worth continuing it. Now that I know that even the “failures” yield useful data I will continue to crunch it as far as it goes. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Thanks for the info. I still have 1 of the “S” series WU’s running on my slower machine at about 19%. I was wondering, given the instability of this series, whether it was worth continuing it. Now that I know that even the “failures” yield useful data I will continue to crunch it as far as it goes. All of the hadcm3n_sXXX_1940_40_ series workunits have been cancelled on the server so I'd abort it Jim. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
I aborted my "s" WU after 57 hours of running in high priority and got one "t". Let's hope it can get some result. My other 5 projects all share one core, including a Virtual Machine from CERN which does not run in high priority. Tullio |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
All of the hadcm3n_sXXX_1940_40_ series workunits have been cancelled on the server so I'd abort it Jim.[/quote] Thanks for the advise Thyme. I have aborted the “S”. |
Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0 |
This is explained 2 posts before yours, in Hadcm3n INVALID THETA ?. :) I just thought it was weird that all 3 failed at the exact same point. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Apart from unintentional failures such as those mentioned a few hours ago in the News thread, models are created in batches, with each one having a slight offset in starting values to the one that preceded it, and to the one that follows it. If someone gets a bunch of datasets that are of these adjacent values, then if one fails at a certain point, it's possible that others around it in parameter space will fail at a similar point. Luck of the draw. As they said during WWII, (or were ready to say): Stay calm and carry on. Or, as some people say these days: Stay calm and carry yarn. (That's a knitting joke. :) ) Backups: Here |
©2024 cpdn.org