Message boards : Number crunching : How to determine checkpoint intervals WAH2 and ANZ?
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Hi there, I'm trying to determine checkpoint intervals for WAH2 and ANZ models running under win7, i5-2020M as I need to shut down the machine, but want to suspend WU after checkpoints were written. For WAH2 I have Checkpoint: 292:32 CPU time: 293:07 Elapsed time: 304:56 For ANZ I have Checkpoint: 98:18 CPU time: 98:42 Elapsed time: 102:08 Do these figure mean that for WAH2 no checkpoint recorded for 12 h (elapsed-Chck) or I should look at difference between CPU time and checkpoint? edit: I guess I should look at CPU time not elapsed, and then check few times per hour to see when it checkpoints. Cheers |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Depending on BOINC version go to either advanced or options and then you can set the event log flags to include checkpoint debug. You will then be able to see all the checkpoints in the event log. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,756,611 RAC: 3,303 |
For wah_units: I think dividing the percentage of progress by 0,.286 or 0.277 will give you approx. the next trickle. greetings bonsai911 |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Bernard If you look at an ANZ model's graphics option, there's a counter at the bottom. This is counting down the number of Timesteps until the next checkpoint process starts. 5-6 years ago, when all of the models had graphics, there were 2 max values, which I think were 48 and 72. Now there's hundreds, which shows how much more detailed the models have become. Back then, I'd look at the counter, to see if it was well below the relevant max, but also well about zero, then Suspend that model. Then I'd repeat for the next model. After which I'd Suspend BOINC itself, then Exit BOINC. The WaH (EU) models don't have a graphics option so you'll need to guess from the Properties option. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Thank you all, Looking at the ANZ graphics the Timestep only shows where I am now of 138,624 timesteps. This is kind of useful to know how much is left until next trickle, but I'm not sure how to determine next checkpoint from there. I will enable checkpoint debug and then calculate the time between two checkpoints. (This time I suspended 10-15 minutes after checkpoints, as I did not know when they will occur) It seems before (like a year ago) I was mostly checking the Properties menu: "CPU time at last checkpoint" and CPU time. Under Linux there was no option (at least obvious) to set checkpoint debug for the event's log. Thanks again. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,033,903 RAC: 14,766 |
Thwere are checkpoints every 24hrs of model time which for the ANZ models is 384 timesteps - they are more frequent as the model approaches midnight. Trickles are every 11,520 timesteps after the first one which is a few more. These work out at roughly every 8.3% of model completion time. Hope this helps. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
In the graphics, there are 2 Timestep listings: one is called Timstep, and counts up continuously, the other is a counter at the bottom of the list, labelled: Save point. This second one starts at a number which is the maximum timesteps in a model's day, and counts down to zero. When it reaches zero, the saving of all of the necessary files begins. This can take several seconds, which is why you need to have this counter less than the maximum before Suspending. edit Must type faster. :) However, if there's 384 per day, then when it's any where less than 350 should be safe. The reason they speed up later in the "day", is because that's the global part of the model, which is less detailed than the regional part, so there's less timesteps in global. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
The problem is that if you are running several CPDN tasks on the same machines, unless they are in perfect syn there are always going to be some that are in the middle of a save cycle. You are always going to loose some work no matter when you shut down. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Which is why you need to check each one individually, then Suspend that model. One after the other. Or take your chances. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Under Linux there was no option (at least obvious) to set checkpoint debug for the event's log. In 7.4.22 at least, not sure about earlier versions <advanced> <Event Log Diganostic Flags> lets you look at check points. I think this is the first linux version of BOINC I have seen it on but it could have been there before and I missed it. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Hi folks Thwere are checkpoints every 24hrs of model time which for the ANZ models is 384 timesteps... Where one can see the model timesteps or can calculate them from? In the graphics, there are 2 Timestep listings: one is called Timstep, and counts up continuously, the other is a counter at the bottom of the list, labelled: Save point No such counter on the graphics I see. The bottom one is Progress (BOINC 7.4.42) Which is why you need to check each one individually, then Suspend that model. As for the models I do check and suspend them individually. On a 4 threads machine running WAH2s and ANZs is a bit demanding if I do not know exactly when to expect the checkpoint. After restart I do not start them simultaneously just to be sure that writing to the disk does not happen for all at the same time. Well, with mixed types of models it might be an overshoot. In 7.4.22 at least, not sure about earlier versions <advanced> <Event Log Diganostic Flags> This is still labelled as Development (since 2014) and may be unstable. I run 7.2.47 and 7.2.42 which are distribution-specific and haven't that function. Thank you all |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
My 2 "Windows" machines are running BOINC 7.6.9 They are actually Linux Mint, running Wine, running a Windows version of BOINC. The BOINC version doesn't matter, it's the climate model version which does, which for the ANZ is 6.10 After the Progress line, there's a blank line/black area, then a line showing Next Savepoint, then another blank line/black area, before the bottom border. If you don't have this, then there's something wrong, perhaps a missing cpdn file/program. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
My 2 "Windows" machines are running BOINC 7.6.9 Here how it looks ANZ v 6.10 graphic. The machine runs Lubuntu, Wine 1.6.2 (Windows 7 environment), BOINC 7.4.42 (x86), wxWidgets version: 3.0.1. The Windows 7 (BOINC 7.6.9) machine I was initially referring to displays the graphic absolutely the same as above under Wine. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
That's the picture all right. It just lacks the last line, which is the one in which you're interested. Odd. Could be a missing/corrupt file. So you'll have to guess. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,033,903 RAC: 14,766 |
I think the savepoint is midnight model time, so if you can see the model time - top line - then anytime between 10:00am and 15:00 should be OK to suspend. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
So far the easiest and perhaps safest (if graphics are incomplete) way might be to set checkpoint debug flag in BOINC when under windows or wine, then do some basic math. As for Linux either check model's properties tab for few hours and calculate the checkpoint intervals or upgrade to the developmental 7.4.22 where the flag could be checked. |
©2024 cpdn.org