climateprediction.net (CPDN) home page
Thread 'How to determine checkpoint intervals WAH2 and ANZ?'

Thread 'How to determine checkpoint intervals WAH2 and ANZ?'

Message boards : Number crunching : How to determine checkpoint intervals WAH2 and ANZ?
Message board moderation

To post messages, you must log in.

AuthorMessage
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 53383 - Posted: 3 Feb 2016, 13:01:05 UTC
Last modified: 3 Feb 2016, 14:00:39 UTC

Hi there,

I'm trying to determine checkpoint intervals for WAH2 and ANZ models running under win7, i5-2020M as I need to shut down the machine, but want to suspend WU after checkpoints were written.

For WAH2 I have
Checkpoint: 292:32
CPU time: 293:07
Elapsed time: 304:56

For ANZ I have
Checkpoint: 98:18
CPU time: 98:42
Elapsed time: 102:08

Do these figure mean that for WAH2 no checkpoint recorded for 12 h (elapsed-Chck) or I should look at difference between CPU time and checkpoint?

edit: I guess I should look at CPU time not elapsed, and then check few times per hour to see when it checkpoints.

Cheers
ID: 53383 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 53384 - Posted: 3 Feb 2016, 14:52:41 UTC - in response to Message 53383.  

Depending on BOINC version go to either advanced or options and then you can set the event log flags to include checkpoint debug. You will then be able to see all the checkpoints in the event log.
ID: 53384 · Report as offensive     Reply Quote
ProfileBonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,756,611
RAC: 3,303
Message 53385 - Posted: 3 Feb 2016, 15:23:10 UTC

For wah_units:

I think dividing the percentage of progress by 0,.286 or 0.277 will give you approx. the next trickle.

greetings

bonsai911
ID: 53385 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53389 - Posted: 3 Feb 2016, 22:27:49 UTC - in response to Message 53383.  

Hi Bernard

If you look at an ANZ model's graphics option, there's a counter at the bottom. This is counting down the number of Timesteps until the next checkpoint process starts.
5-6 years ago, when all of the models had graphics, there were 2 max values, which I think were 48 and 72. Now there's hundreds, which shows how much more detailed the models have become.

Back then, I'd look at the counter, to see if it was well below the relevant max, but also well about zero, then Suspend that model. Then I'd repeat for the next model. After which I'd Suspend BOINC itself, then Exit BOINC.
The WaH (EU) models don't have a graphics option so you'll need to guess from the Properties option.

ID: 53389 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 53390 - Posted: 3 Feb 2016, 23:45:45 UTC - in response to Message 53389.  

Thank you all,

Looking at the ANZ graphics the Timestep only shows where I am now of 138,624 timesteps. This is kind of useful to know how much is left until next trickle, but I'm not sure how to determine next checkpoint from there.

I will enable checkpoint debug and then calculate the time between two checkpoints. (This time I suspended 10-15 minutes after checkpoints, as I did not know when they will occur)

It seems before (like a year ago) I was mostly checking the Properties menu: "CPU time at last checkpoint" and CPU time. Under Linux there was no option (at least obvious) to set checkpoint debug for the event's log.

Thanks again.

ID: 53390 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,033,903
RAC: 14,766
Message 53391 - Posted: 3 Feb 2016, 23:58:16 UTC - in response to Message 53390.  

Thwere are checkpoints every 24hrs of model time which for the ANZ models is 384 timesteps - they are more frequent as the model approaches midnight. Trickles are every 11,520 timesteps after the first one which is a few more. These work out at roughly every 8.3% of model completion time.
Hope this helps.
ID: 53391 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53392 - Posted: 4 Feb 2016, 0:02:45 UTC - in response to Message 53390.  
Last modified: 4 Feb 2016, 0:11:43 UTC

In the graphics, there are 2 Timestep listings: one is called Timstep, and counts up continuously, the other is a counter at the bottom of the list, labelled: Save point.
This second one starts at a number which is the maximum timesteps in a model's day, and counts down to zero. When it reaches zero, the saving of all of the necessary files begins. This can take several seconds, which is why you need to have this counter less than the maximum before Suspending.

edit
Must type faster. :)

However, if there's 384 per day, then when it's any where less than 350 should be safe.
The reason they speed up later in the "day", is because that's the global part of the model, which is less detailed than the regional part, so there's less timesteps in global.
ID: 53392 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 53393 - Posted: 4 Feb 2016, 0:16:21 UTC

The problem is that if you are running several CPDN tasks on the same machines, unless they are in perfect syn there are always going to be some that are in the middle of a save cycle. You are always going to loose some work no matter when you shut down.

ID: 53393 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53394 - Posted: 4 Feb 2016, 0:40:04 UTC - in response to Message 53393.  

Which is why you need to check each one individually, then Suspend that model. One after the other.
Or take your chances.

ID: 53394 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 53398 - Posted: 4 Feb 2016, 7:32:55 UTC

Under Linux there was no option (at least obvious) to set checkpoint debug for the event's log.


In 7.4.22 at least, not sure about earlier versions <advanced> <Event Log Diganostic Flags> lets you look at check points. I think this is the first linux version of BOINC I have seen it on but it could have been there before and I missed it.
ID: 53398 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 53399 - Posted: 4 Feb 2016, 8:18:56 UTC - in response to Message 53391.  

Hi folks

Thwere are checkpoints every 24hrs of model time which for the ANZ models is 384 timesteps...


Where one can see the model timesteps or can calculate them from?

In the graphics, there are 2 Timestep listings: one is called Timstep, and counts up continuously, the other is a counter at the bottom of the list, labelled: Save point


No such counter on the graphics I see. The bottom one is Progress (BOINC 7.4.42)

Which is why you need to check each one individually, then Suspend that model.


As for the models I do check and suspend them individually. On a 4 threads machine running WAH2s and ANZs is a bit demanding if I do not know exactly when to expect the checkpoint. After restart I do not start them simultaneously just to be sure that writing to the disk does not happen for all at the same time. Well, with mixed types of models it might be an overshoot.

In 7.4.22 at least, not sure about earlier versions <advanced> <Event Log Diganostic Flags>


This is still labelled as Development (since 2014) and may be unstable. I run 7.2.47 and 7.2.42 which are distribution-specific and haven't that function.

Thank you all
ID: 53399 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53402 - Posted: 4 Feb 2016, 11:23:57 UTC - in response to Message 53399.  

My 2 "Windows" machines are running BOINC 7.6.9
They are actually Linux Mint, running Wine, running a Windows version of BOINC.

The BOINC version doesn't matter, it's the climate model version which does, which for the ANZ is 6.10

After the Progress line, there's a blank line/black area, then a line showing Next Savepoint, then another blank line/black area, before the bottom border.

If you don't have this, then there's something wrong, perhaps a missing cpdn file/program.

ID: 53402 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 53404 - Posted: 4 Feb 2016, 15:20:25 UTC - in response to Message 53402.  

My 2 "Windows" machines are running BOINC 7.6.9
They are actually Linux Mint, running Wine, running a Windows version of BOINC.

The BOINC version doesn't matter, it's the climate model version which does, which for the ANZ is 6.10

After the Progress line, there's a blank line/black area, then a line showing Next Savepoint, then another blank line/black area, before the bottom border.

If you don't have this, then there's something wrong, perhaps a missing cpdn file/program.


Here how it looks ANZ v 6.10
graphic. The machine runs Lubuntu, Wine 1.6.2 (Windows 7 environment), BOINC 7.4.42 (x86), wxWidgets version: 3.0.1.

The Windows 7 (BOINC 7.6.9) machine I was initially referring to displays the graphic absolutely the same as above under Wine.
ID: 53404 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53406 - Posted: 4 Feb 2016, 23:27:52 UTC - in response to Message 53404.  

That's the picture all right. It just lacks the last line, which is the one in which you're interested. Odd.

Could be a missing/corrupt file.

So you'll have to guess.


ID: 53406 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,033,903
RAC: 14,766
Message 53408 - Posted: 5 Feb 2016, 10:46:36 UTC - in response to Message 53406.  

I think the savepoint is midnight model time, so if you can see the model time - top line - then anytime between 10:00am and 15:00 should be OK to suspend.
ID: 53408 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 53410 - Posted: 5 Feb 2016, 19:38:56 UTC

So far the easiest and perhaps safest (if graphics are incomplete) way might be to set checkpoint debug flag in BOINC when under windows or wine, then do some basic math. As for Linux either check model's properties tab for few hours and calculate the checkpoint intervals or upgrade to the developmental 7.4.22 where the flag could be checked.
ID: 53410 · Report as offensive     Reply Quote

Message boards : Number crunching : How to determine checkpoint intervals WAH2 and ANZ?

©2024 cpdn.org