Message boards : Number crunching : HadCM3n release
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
If I remember correctly, there should be a trickle every time that this type of model finishes a model year. Zips are sent every 10 model years. Have your looked at the graphics? Are the time steps advancing? |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
These are 40 year models. They trickle every year so 2.5% and upload files every decade so 25%. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Last one of these on my machines now completed, so now back to running windows tasks via Wine. |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
ok - just checked graphics...no progress at all..."step 1 of 1,039,392"...for all three WUs... odd thing is when they each started, things looked normal for a day or two... a puzzle indeed... |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
suspended tasks, shut down BOINC, restarted BOINC, resumed tasks...same as before: no visible CPU activity...but clock running and progress climbing... aborted... i noticed that wingmen all errored out on these WUs... thanks to everyone for offering info and suggestions...a real puzzle... frank |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Random thought: These models are apparently looking at the very edge of Known Parameter Space. (Cue dramatic drum beat). So perhaps all of the ones that you had that wouldn't even start, "failed" before starting. The program may have been Starting/stopping.Starting/stopping. Starting/stopping. etc |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846 |
|
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
I see the notice about aborting batches 350-3 in the news. I have one that seems to have 352 in the name, so assume it is to be culled. As it has already run 320hrs, will someone please confirm this for me before I quit the task. See my tasks here |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Unless you want to edit clientstate.xml to allow the larger upload size. The tasks that were still to go out were culled but those that missed the first cull I think get reissued automatically till they have had their three goes. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Hi Dave, quite happy to alter clientstate.xml, but not sure I understand your answer. I've nothing waiting to be transferred, trickles seem to be going up OK (you may note a week with nothing when I was on holiday), there doesn't seem to be much in the project folders. So, does this mean I leave it running or abort? The news article from Sarah posted by Les on 15 March implies that all those batch numbers get aborted. BTW, that should have been 220hrs in my original post, not 320hrs. Currently 166hrs to go. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Martin If you continue, then you'll get an error message for one of the zips saying something like: File too big - truncated. This is because the wrong stash file was used. But the processing will continue until the end. So Abort Sorry about this problem. There are so many different experiments going on now, that it has become hard to remember which accessory goes with which outfit. Sort of. :) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Bernard ATM_DYN : INVALID THETA DETECTED. That's normal. It's an abbreviated message about the atmospheric physics going out of acceptable values. Which is what is expected for those models, which are looking at just how sensitive they are when pushed too far. At least they don't bite. :) |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846 |
Hi Les, the strange thing is that the same WU managed to get at least 3 trickles (haven't followed up) after crashing on my machine. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Hi Les, I have seen the same thing, except that it completed on my machine even after erroring out on three others with "INVALID THETA DETECTED". So it is presumably some sort of extreme case where slight differences in hardware cause some machines to error out but not others. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=8003#51187 |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846 |
[ Thanks, kind of missed that one. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I think that it's mostly the FPU maths, and how Intel and AMD e.g. use different algorithms. And even slight overclocking might change the results of these, because they are so close to the edge of stability. And perhaps how many tasks have been crammed into the machine. Different processor types also have different cache values. Different amounts of memory, and how often data gets shoved out onto the HD. Everyone is probably on their own here; what happens to others on the same workunit won't count. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Bernard the strange thing is that the same WU managed to get at least 3 trickles (haven't followed up) after crashing on my machine. I think that's to do with the way that the BOINC client works. If you study the Event log when uploads happen, it's a bit like this: 1) Send all trickles 2) Start sending the zips, in the order in which they were created. Once this starts, then 3) Report any failed tasks. At which point, any zips from the failed task(s) will suddenly disappear from the Upload queue. It's possible for trickle_up files and zips to still be on a computer for a while for some reason. One reason is that the Network setting in the BOINC manager is Off. Another is that there may be problems contacting the server. And you have been experimenting. :) |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846 |
Thanks, though I meant that after it crashed on my machine at few hundreds seconds (no trickles produced at all), then on other machine it was producing trickles steadily and went beyond thousands of seconds at may have even finished succesfully. So I guess it is more like the "case where slight differences in hardware cause some machines to error out but not others." So far will wait abit more until jumping on the HadCM3n wagon |
Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,610,380 RAC: 3,377 |
I think that it's mostly the FPU maths, and how Intel and AMD e.g. use different algorithms. And even slight overclocking might change the results of these, because they are so close to the edge of stability. CPDN apparently uses very unstable algoritms if they depend upon the exact rounding. |
©2024 cpdn.org