Thread 'HadCM3n release'

Author	Message
JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 53576 - Posted: 5 Mar 2016, 16:46:54 UTC - in response to Message 53575. Last modified: 5 Mar 2016, 16:49:20 UTC If I remember correctly, there should be a trickle every time that this type of model finishes a model year. Zips are sent every 10 model years. Have your looked at the graphics? Are the time steps advancing? ID: 53576 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 53577 - Posted: 5 Mar 2016, 16:47:09 UTC These are 40 year models. They trickle every year so 2.5% and upload files every decade so 25%. ID: 53577 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 53578 - Posted: 5 Mar 2016, 18:07:06 UTC Last one of these on my machines now completed, so now back to running windows tasks via Wine. ID: 53578 · Reply Quote

KWSN - Sir Frank of the Wood Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0	Message 53581 - Posted: 5 Mar 2016, 21:18:27 UTC ok - just checked graphics...no progress at all..."step 1 of 1,039,392"...for all three WUs... odd thing is when they each started, things looked normal for a day or two... a puzzle indeed... ID: 53581 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53585 - Posted: 6 Mar 2016, 1:22:21 UTC - in response to Message 53581. OK, one last thing to try: Shut down BOINC. Restart BOINC. Are they running now? If not, then you may as well abort them. ****************** For reference, this is one of mine. Different batch, from last December: here on a Haswell. This is one still running on an AMD: here, same batch as yours. ID: 53585 · Reply Quote

KWSN - Sir Frank of the Wood Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0	Message 53586 - Posted: 6 Mar 2016, 2:23:41 UTC Last modified: 6 Mar 2016, 2:34:31 UTC suspended tasks, shut down BOINC, restarted BOINC, resumed tasks...same as before: no visible CPU activity...but clock running and progress climbing... aborted... i noticed that wingmen all errored out on these WUs... thanks to everyone for offering info and suggestions...a real puzzle... frank ID: 53586 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53589 - Posted: 6 Mar 2016, 7:37:36 UTC - in response to Message 53586. Random thought: These models are apparently looking at the very edge of Known Parameter Space. (Cue dramatic drum beat). So perhaps all of the ones that you had that wouldn't even start, "failed" before starting. The program may have been Starting/stopping.Starting/stopping. Starting/stopping. etc ID: 53589 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846	Message 53595 - Posted: 6 Mar 2016, 20:26:47 UTC I also had one. It errored out with this message: <message> Bad command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: ATM_DYN : INVALID THETA DETECTED. (x5) I ran it under WINE, but my wingman seems to crunch it just fine. So not sure was the error on my side or not. ID: 53595 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 53733 - Posted: 22 Mar 2016, 8:20:25 UTC I see the notice about aborting batches 350-3 in the news. I have one that seems to have 352 in the name, so assume it is to be culled. As it has already run 320hrs, will someone please confirm this for me before I quit the task. See my tasks here ID: 53733 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 53734 - Posted: 22 Mar 2016, 8:37:45 UTC - in response to Message 53733. Unless you want to edit clientstate.xml to allow the larger upload size. The tasks that were still to go out were culled but those that missed the first cull I think get reissued automatically till they have had their three goes. ID: 53734 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 53742 - Posted: 22 Mar 2016, 20:09:22 UTC - in response to Message 53734. Last modified: 22 Mar 2016, 20:30:38 UTC Hi Dave, quite happy to alter clientstate.xml, but not sure I understand your answer. I've nothing waiting to be transferred, trickles seem to be going up OK (you may note a week with nothing when I was on holiday), there doesn't seem to be much in the project folders. So, does this mean I leave it running or abort? The news article from Sarah posted by Les on 15 March implies that all those batch numbers get aborted. BTW, that should have been 220hrs in my original post, not 320hrs. Currently 166hrs to go. ID: 53742 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53744 - Posted: 22 Mar 2016, 20:35:32 UTC - in response to Message 53742. Hi Martin If you continue, then you'll get an error message for one of the zips saying something like: File too big - truncated. This is because the wrong stash file was used. But the processing will continue until the end. So Abort Sorry about this problem. There are so many different experiments going on now, that it has become hard to remember which accessory goes with which outfit. Sort of. :) ID: 53744 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53745 - Posted: 22 Mar 2016, 20:42:28 UTC - in response to Message 53595. Hi Bernard ATM_DYN : INVALID THETA DETECTED. That's normal. It's an abbreviated message about the atmospheric physics going out of acceptable values. Which is what is expected for those models, which are looking at just how sensitive they are when pushed too far. At least they don't bite. :) ID: 53745 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846	Message 53746 - Posted: 22 Mar 2016, 20:49:56 UTC - in response to Message 53745. Hi Les, the strange thing is that the same WU managed to get at least 3 trickles (haven't followed up) after crashing on my machine. ID: 53746 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 53747 - Posted: 22 Mar 2016, 21:22:53 UTC - in response to Message 53746. Hi Les, the strange thing is that the same WU managed to get at least 3 trickles (haven't followed up) after crashing on my machine. I have seen the same thing, except that it completed on my machine even after erroring out on three others with "INVALID THETA DETECTED". So it is presumably some sort of extreme case where slight differences in hardware cause some machines to error out but not others. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=8003#51187 ID: 53747 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846	Message 53748 - Posted: 22 Mar 2016, 21:26:53 UTC - in response to Message 53747. [ I have seen the same thing, except that it completed on my machine even after erroring out on three others with "INVALID THETA DETECTED". So it is presumably some sort of extreme case where slight differences in hardware cause some machines to error out but not others. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=8003#51187 Thanks, kind of missed that one. ID: 53748 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53749 - Posted: 22 Mar 2016, 21:48:32 UTC I think that it's mostly the FPU maths, and how Intel and AMD e.g. use different algorithms. And even slight overclocking might change the results of these, because they are so close to the edge of stability. And perhaps how many tasks have been crammed into the machine. Different processor types also have different cache values. Different amounts of memory, and how often data gets shoved out onto the HD. Everyone is probably on their own here; what happens to others on the same workunit won't count. ID: 53749 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53750 - Posted: 22 Mar 2016, 22:58:01 UTC - in response to Message 53746. Bernard the strange thing is that the same WU managed to get at least 3 trickles (haven't followed up) after crashing on my machine. I think that's to do with the way that the BOINC client works. If you study the Event log when uploads happen, it's a bit like this: 1) Send all trickles 2) Start sending the zips, in the order in which they were created. Once this starts, then 3) Report any failed tasks. At which point, any zips from the failed task(s) will suddenly disappear from the Upload queue. It's possible for trickle_up files and zips to still be on a computer for a while for some reason. One reason is that the Network setting in the BOINC manager is Off. Another is that there may be problems contacting the server. And you have been experimenting. :) ID: 53750 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846	Message 53751 - Posted: 22 Mar 2016, 23:16:01 UTC - in response to Message 53750. Thanks, though I meant that after it crashed on my machine at few hundreds seconds (no trickles produced at all), then on other machine it was producing trickles steadily and went beyond thousands of seconds at may have even finished succesfully. So I guess it is more like the "case where slight differences in hardware cause some machines to error out but not others." So far will wait abit more until jumping on the HadCM3n wagon ID: 53751 · Reply Quote

Alex Plantema Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,610,380 RAC: 3,377	Message 53755 - Posted: 23 Mar 2016, 9:17:12 UTC - in response to Message 53749. I think that it's mostly the FPU maths, and how Intel and AMD e.g. use different algorithms. And even slight overclocking might change the results of these, because they are so close to the edge of stability. CPDN apparently uses very unstable algoritms if they depend upon the exact rounding. ID: 53755 · Reply Quote