Thread 'Is "Invalid Theta Detected" always due to bad work units?'

Author	Message
Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 51187 - Posted: 13 Jan 2015, 3:50:01 UTC Last modified: 13 Jan 2015, 3:50:54 UTC I am a bit out of my depths here, but I understand that an "INVALID THETA DETECTED" error usually means a model ran with the wrong parameters. In that case, the scientists know that those parameters are not realistic, and so they try again with something else. However, a while ago I completed a hadcm3n long work unit where all three others who got it failed with the "INVALID THETA DETECTED" error. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9277901 So it seems that the parameters may not have been wrong in that case, and so that condition might be marked as unrealistic when in fact that is not the case. There may need to be some rethinking of the relevant assumptions by someone who needs to know that, and so I pass it along in the hopes that it will get to the right person. ID: 51187 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 51189 - Posted: 13 Jan 2015, 6:08:06 UTC - in response to Message 51187. The full message is: ATM_DYN : INVALID THETA DETECTED, where ATM_DYN is Atmospheric Dynamics, and means that the physics has gone out of the set limits. This is one of the two things that the researchers are looking for, so that they know how long the initial conditions remain stable. And it may take several "sections" of short models to be run before it gets to that point. (The other thing they look for, is a model that runs OK to completion. This is where they say: "Oh well, lets run the next section and see if we can crash it".) ID: 51189 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 51190 - Posted: 13 Jan 2015, 6:55:48 UTC Last modified: 13 Jan 2015, 6:58:58 UTC About the case where one machine fails with ATM_DYN : INVALID THETA DETECTED,, and another completes .. what I understand is -- When the researchers are "pushing the envelope" and testing the Hadley model to its limits, Even the tiniest differences between volunteer host machines -- like a cosmic ray that flips a bit, or the bigger ones, like slightly different math libraries on different hardware or software versions -- after the thousands of steps in any model, might add up and cause a difference in the final result. The researchers have to know the limits of repproduc -- of how close different runs of the model agree. Or if the modelling goes "out of bounds" like the INV THETA case. ANY tiny difference in the initial conditions could push a model "out of bounds", when combined with those software, hardware, etc differences. The researchers have to test the limits of their tools to know how much change in the model parameters will lead to unverifiable results. Like any experiment in undergrad chemistry -- You gotta test your measuring system, as well as the thing you are trying to measure. (and then there's all the clerical errors, BOINC software dependencies --- etc -- and similar that confuse things even more. Any comments? ID: 51190 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 51195 - Posted: 13 Jan 2015, 16:13:48 UTC - in response to Message 51190. Even the tiniest differences between volunteer host machines -- like a cosmic ray that flips a bit, or the bigger ones, like slightly different math libraries on different hardware or software versions -- after the thousands of steps in any model, might add up and cause a difference in the final result. The researchers have to know the limits of repproduc -- of how close different runs of the model agree. Or if the modelling goes "out of bounds" like the INV THETA case. Very interesting. I had associated variations in the results more with GPUs than CPUs, but I guess for this project anything can change the results. I will add a Haswell machine to try to get more hadcm3n long work units, and see if I can get it stable enough to see the same sort of tiny differences. ID: 51195 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4558 Credit: 19,039,635 RAC: 18,944	Message 51197 - Posted: 13 Jan 2015, 17:10:03 UTC - in response to Message 51195. Very interesting. I had associated variations in the results more with GPUs than CPUs, but I guess for this project anything can change the results. Also differences between Operating systems. Dave ID: 51197 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,944,701 RAC: 2,164	Message 51198 - Posted: 13 Jan 2015, 17:34:21 UTC I think the influence of variations is overstated. Models whose parameters differ by a small amount may produce very different results because of the chaotic development of the simulated climate. Models run on machines with different processor types (e.g Intel vs AMD) will differ too, as do different operating systems (particularly Linux). My impression some time ago from running multiple slab models on multiple computers was that results only differed in understandable ways. And my own professional experience of endlessly regression-testing Monte Carlo simulations with billions of trials is that, happily, the tests succeed - i.e. the results don't change. However, if events local to the machine affected the simulation outcomes (such as flipped bits) then there would be widespread crashes not only in CPDN but in the operating system itself. (It would make a nice study for someone, though - "Distributed Computing result variability with latitude" and suchlike.) Personally, the very high error rate on HADCM3S with errors that, as Les says, are conventionally physics errors raises questions for me about whether something else is wrong with that model. Is there any reason to suppose that the parameter-space sampling in this group of models is more aggressive than usual? The project description doesn't say that, but it could be the case. ID: 51198 · Reply Quote

Conan Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420	Message 51358 - Posted: 4 Feb 2015, 4:15:19 UTC Last modified: 4 Feb 2015, 4:16:19 UTC Well they are still happening as I just had about 6 fail over the last day with this error, they run for about 21 minutes then fail. All are new work units not ones from last October. This is on a Windows XP 32 Bit machine. Conan ID: 51358 · Reply Quote