climateprediction.net (CPDN) home page
Thread 'HadCM3n release'

Thread 'HadCM3n release'

Message boards : Number crunching : HadCM3n release
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 53576 - Posted: 5 Mar 2016, 16:46:54 UTC - in response to Message 53575.  
Last modified: 5 Mar 2016, 16:49:20 UTC

If I remember correctly, there should be a trickle every time that this type of model finishes a model year. Zips are sent every 10 model years.

Have your looked at the graphics? Are the time steps advancing?
ID: 53576 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 53577 - Posted: 5 Mar 2016, 16:47:09 UTC

These are 40 year models. They trickle every year so 2.5% and upload files every decade so 25%.
ID: 53577 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 53578 - Posted: 5 Mar 2016, 18:07:06 UTC

Last one of these on my machines now completed, so now back to running windows tasks via Wine.
ID: 53578 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 53581 - Posted: 5 Mar 2016, 21:18:27 UTC

ok - just checked graphics...no progress at all..."step 1 of 1,039,392"...for all three WUs...

odd thing is when they each started, things looked normal for a day or two...

a puzzle indeed...

ID: 53581 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53585 - Posted: 6 Mar 2016, 1:22:21 UTC - in response to Message 53581.  

OK, one last thing to try:
Shut down BOINC.
Restart BOINC.

Are they running now?
If not, then you may as well abort them.

******************

For reference, this is one of mine. Different batch, from last December:
here on a Haswell.

This is one still running on an AMD:
here, same batch as yours.

ID: 53585 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 53586 - Posted: 6 Mar 2016, 2:23:41 UTC
Last modified: 6 Mar 2016, 2:34:31 UTC

suspended tasks, shut down BOINC, restarted BOINC, resumed tasks...same as before: no visible CPU activity...but clock running and progress climbing...

aborted...

i noticed that wingmen all errored out on these WUs...

thanks to everyone for offering info and suggestions...a real puzzle...

frank
ID: 53586 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53589 - Posted: 6 Mar 2016, 7:37:36 UTC - in response to Message 53586.  

Random thought:
These models are apparently looking at the very edge of Known Parameter Space. (Cue dramatic drum beat).
So perhaps all of the ones that you had that wouldn't even start, "failed" before starting. The program may have been Starting/stopping.Starting/stopping. Starting/stopping. etc

ID: 53589 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 53595 - Posted: 6 Mar 2016, 20:26:47 UTC

I also had one. It errored out with this message:

<message>
Bad command.
(0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

Model crashed: ATM_DYN : INVALID THETA DETECTED.
(x5)

I ran it under WINE, but my wingman seems to crunch it just fine. So not sure was the error on my side or not.
ID: 53595 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 53733 - Posted: 22 Mar 2016, 8:20:25 UTC

I see the notice about aborting batches 350-3 in the news. I have one that seems to have 352 in the name, so assume it is to be culled. As it has already run 320hrs, will someone please confirm this for me before I quit the task.
See my tasks here
ID: 53733 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 53734 - Posted: 22 Mar 2016, 8:37:45 UTC - in response to Message 53733.  

Unless you want to edit clientstate.xml to allow the larger upload size. The tasks that were still to go out were culled but those that missed the first cull I think get reissued automatically till they have had their three goes.
ID: 53734 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 53742 - Posted: 22 Mar 2016, 20:09:22 UTC - in response to Message 53734.  
Last modified: 22 Mar 2016, 20:30:38 UTC

Hi Dave, quite happy to alter clientstate.xml, but not sure I understand your answer.

I've nothing waiting to be transferred, trickles seem to be going up OK (you may note a week with nothing when I was on holiday), there doesn't seem to be much in the project folders.

So, does this mean I leave it running or abort? The news article from Sarah posted by Les on 15 March implies that all those batch numbers get aborted.

BTW, that should have been 220hrs in my original post, not 320hrs. Currently 166hrs to go.
ID: 53742 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53744 - Posted: 22 Mar 2016, 20:35:32 UTC - in response to Message 53742.  

Hi Martin

If you continue, then you'll get an error message for one of the zips saying something like: File too big - truncated. This is because the wrong stash file was used.
But the processing will continue until the end.

So Abort

Sorry about this problem.
There are so many different experiments going on now, that it has become hard to remember which accessory goes with which outfit. Sort of. :)

ID: 53744 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53745 - Posted: 22 Mar 2016, 20:42:28 UTC - in response to Message 53595.  

Hi Bernard

ATM_DYN : INVALID THETA DETECTED.


That's normal. It's an abbreviated message about the atmospheric physics going out of acceptable values. Which is what is expected for those models, which are looking at just how sensitive they are when pushed too far.

At least they don't bite. :)

ID: 53745 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 53746 - Posted: 22 Mar 2016, 20:49:56 UTC - in response to Message 53745.  

Hi Les,
the strange thing is that the same WU managed to get at least 3 trickles (haven't followed up) after crashing on my machine.
ID: 53746 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53747 - Posted: 22 Mar 2016, 21:22:53 UTC - in response to Message 53746.  

Hi Les,
the strange thing is that the same WU managed to get at least 3 trickles (haven't followed up) after crashing on my machine.

I have seen the same thing, except that it completed on my machine even after erroring out on three others with "INVALID THETA DETECTED". So it is presumably some sort of extreme case where slight differences in hardware cause some machines to error out but not others.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=8003#51187
ID: 53747 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 53748 - Posted: 22 Mar 2016, 21:26:53 UTC - in response to Message 53747.  

[
I have seen the same thing, except that it completed on my machine even after erroring out on three others with "INVALID THETA DETECTED". So it is presumably some sort of extreme case where slight differences in hardware cause some machines to error out but not others.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=8003#51187

Thanks, kind of missed that one.
ID: 53748 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53749 - Posted: 22 Mar 2016, 21:48:32 UTC

I think that it's mostly the FPU maths, and how Intel and AMD e.g. use different algorithms. And even slight overclocking might change the results of these, because they are so close to the edge of stability.

And perhaps how many tasks have been crammed into the machine.
Different processor types also have different cache values.
Different amounts of memory, and how often data gets shoved out onto the HD.

Everyone is probably on their own here; what happens to others on the same workunit won't count.

ID: 53749 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53750 - Posted: 22 Mar 2016, 22:58:01 UTC - in response to Message 53746.  

Bernard

the strange thing is that the same WU managed to get at least 3 trickles (haven't followed up) after crashing on my machine.

I think that's to do with the way that the BOINC client works.

If you study the Event log when uploads happen, it's a bit like this:

1) Send all trickles
2) Start sending the zips, in the order in which they were created. Once this starts, then
3) Report any failed tasks.

At which point, any zips from the failed task(s) will suddenly disappear from the Upload queue.

It's possible for trickle_up files and zips to still be on a computer for a while for some reason. One reason is that the Network setting in the BOINC manager is Off. Another is that there may be problems contacting the server.
And you have been experimenting. :)

ID: 53750 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 53751 - Posted: 22 Mar 2016, 23:16:01 UTC - in response to Message 53750.  

Thanks, though I meant that after it crashed on my machine at few hundreds seconds (no trickles produced at all), then on other machine it was producing trickles steadily and went beyond thousands of seconds at may have even finished succesfully. So I guess it is more like the "case where slight differences in hardware cause some machines to error out but not others." So far will wait abit more until jumping on the HadCM3n wagon
ID: 53751 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,610,380
RAC: 3,377
Message 53755 - Posted: 23 Mar 2016, 9:17:12 UTC - in response to Message 53749.  

I think that it's mostly the FPU maths, and how Intel and AMD e.g. use different algorithms. And even slight overclocking might change the results of these, because they are so close to the edge of stability.

CPDN apparently uses very unstable algoritms if they depend upon the exact rounding.
ID: 53755 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : HadCM3n release

©2024 cpdn.org