climateprediction.net (CPDN) home page
Thread 'UK Met Office HadAM4 at N216 resolution'

Thread 'UK Met Office HadAM4 at N216 resolution'

Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61311 - Posted: 21 Oct 2019, 9:44:59 UTC - in response to Message 61306.  

It looks like the cache is the culprit..

This will slow down those 64 and 128 core machines.
Unless they're just crashing them because of the missing lib.


Heaven forbid that lack of cache memory should slow down their crashing!
ID: 61311 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61312 - Posted: 21 Oct 2019, 11:23:57 UTC - in response to Message 61306.  

I wonder why they put so much cache in my relatively slow Processor. I am glad they did.
My kernel is not all that old: 2019 Sep 17 09:53 vmlinuz-2.6.32-754.23.1.el6.x86_64
CPU type 	GenuineIntel
Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7]
Number of processors 	4
Operating System 	Linux  2.6.32-754.23.1.el6.x86_64
BOINC version 	7.2.33
Memory  	15.5 GB
Cache    	10240 KB
Swap space  	3.91 GB
Total disk space 	117.21 GB
Free Disk Space 	103.25 GB
Measured floating point speed 	1.27 billion ops/sec
Measured integer speed   	3.52 billion ops/sec
Average upload rate   	2174.4 KB/sec
Average download rate 	9265.96 KB/sec

ID: 61312 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61313 - Posted: 21 Oct 2019, 12:12:11 UTC
Last modified: 21 Oct 2019, 12:41:37 UTC

I am trying a single N216 on my Ryzen 3700x, and after 3 1/2 hours the estimated completion time is 7.5 days.
That is pretty good, though how many I will be able to run is to be determined.

The good news is that people won't need so much memory, especially for the Open IFS.
They will be limited by the cache.

(I have to clear out some WCG work before I get back to it in a couple of days. That includes a few MIP1, and I don't want them to interfere.)
ID: 61313 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61323 - Posted: 22 Oct 2019, 2:14:16 UTC

Do N216 work units make trickles? If so, when?

My present work unit is 18.96% complete, having run for 101 hours with 257 hours to complete (if you believe these numbers). Last checkpoint was at about 92 hours.

It has not attempted to send any trickles yet.

name hadam4h_a0pg_200811_4_842_011905372
Task 21760249 Workunit 11905372
ID: 61323 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61325 - Posted: 22 Oct 2019, 2:18:26 UTC

After a bit of experimenting, I find that setting "use at most 50% of the processors" works best for me on both the i7-8700 and the i7-9700.
That means six virtual cores on the i7-8700 and four full cores on the i7-9700 (they both have 12 MB L3 cache).

The fast way to find how much work is being done is just check the writes to disk; the more the writes, the more the work done.
I use "iostat -m 7200" to measure the writes (in MB) over a two-hour (7200 second) period.
If you have not used iostat before, you first run "sudo apt install sysstat" to install it.

This implies that with its 32 MB of L3 cache, the Ryzen 3700x should run best on 8 cores, but I would check it to be sure.
ID: 61325 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 61326 - Posted: 22 Oct 2019, 2:26:11 UTC - in response to Message 61323.  

Do N216 work units make trickles? If so, when?

Yes. The 4 in the task name means a 4 month model so it will trickle after every month, at 25% Progress.

Yep...a long time between checkpoints and a long time between trickles. This was brought up during testing but they kept this as with the other models, one checkpoint per model day and one trickle per model month.
ID: 61326 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61333 - Posted: 22 Oct 2019, 12:36:55 UTC - in response to Message 61326.  

Do N216 work units make trickles? If so, when?

Yes. The 4 in the task name means a 4 month model so it will trickle after every month, at 25% Progress.


Thank you. I was beginning to wonder if something was wrong.

P.s.: I am not requesting any change.
ID: 61333 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61337 - Posted: 22 Oct 2019, 14:32:38 UTC

My i7-4790 running on four cores (50% of processors) is taking 12 days per work unit.

But that reminds me that a simple way of estimating the proper number of cores to use is to look at the CPU % (as in BOINC tasks).
When you are operating properly in the cache, it will be high, up around 99% or so. But if you are running too many work units, then it will take a hit down to 70% or so.
There are more accurate ways of determining what is going on, but that is a quick estimate.
ID: 61337 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61365 - Posted: 24 Oct 2019, 5:31:37 UTC

My 8 have just reached halfway, after just under 7 days.
ID: 61365 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61369 - Posted: 24 Oct 2019, 12:37:36 UTC
Last modified: 24 Oct 2019, 12:38:18 UTC

I am having a hard time getting consistent readings using the write-to-disk method. Even with a two hour monitoring period, I get large variations from 300 MB_wrtn to 4000 MB_wrtn on my Ryzen 3700X when running on 9 cores (and see about the same variation on 7 cores). So I will just go with 8 cores, with estimated completed times of about 15 days.

Overall, based on estimated completion times, running on my i7-9700 with 4 cores (50%) does the best, with completion times about 7 days. I think the 32 GB of memory in it will be enough for the OpenIFS, which is convenient.
ID: 61369 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61373 - Posted: 24 Oct 2019, 16:25:08 UTC - in response to Message 61369.  

Do you use iostat? The first time it runs it gives the usage since the system was booted. Then the amounts in subsequent intervals. My machine runs boinc all the time it is up, but I booted it almost 5 days ago, so the totals are relatively small. All boincdata (including programs), and only boinc, are on /dev/sdd1. Boinc client has been running two hadcm3s, one hadam4, and one hadam4h the whole time. There are two other partitions on that drive one of which can be busy if I choose to watch videos.

Normally, one would use a greater interval than 60 seconds for this, and more than two outputs.

$ iostat -p sdd -k 60 2
Linux 2.6.32-754.23.1.el6.x86_64 (DellT7600.localdomain) 	10/24/2019 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.65   95.38    0.86    0.06    0.00    0.05

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              13.21        22.59       617.35    9491521  259414857
sdd1              0.01         1.34         0.00     564356        560
sdd2              0.00         0.01         0.00       2149         45
sdd3             13.20        21.24       617.34    8924764  259414252

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.52   93.55    1.51    1.37    0.00    0.05

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              93.32         0.33     10521.40         20     631284
sdd1              0.00         0.00         0.00          0          0
sdd2              0.00         0.00         0.00          0          0
sdd3             93.32         0.33     10521.40         20     631284


ID: 61373 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61374 - Posted: 24 Oct 2019, 17:04:09 UTC - in response to Message 61373.  
Last modified: 24 Oct 2019, 17:05:26 UTC

Do you use iostat? The first time it runs it gives the usage since the system was booted. Then the amounts in subsequent intervals.

Yes, I was running "iostat -m 7200" (two hours), and disregarded the first one. The next two gave around 300 MB, and only the third gave a reasonable value of around 4000 MB_wrtn.

So I have to conclude that the work units vary a lot in what they write over time. I would probably have to do a 24-hour interval to get a reliable measurement, and in that amount of time I can just calculate it from the run time and % completed.

I don't need perfect accuracy, just enough to decide which machine to use, and how many cores. I think I have that, for this set of work anyway.
ID: 61374 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61380 - Posted: 25 Oct 2019, 13:38:34 UTC - in response to Message 61374.  

I had to reboot my machine last night (problems with SELinux OS), and I then started running iostat hourly for 24 hours (not yet completed). Here are the results so far. /dev/sdd3 is the partition with all boinc, and only boinc, init. (I said something else the other day, but that was a mistake).

I have to conclude that the work units vary a lot in what they write over time. I would probably have to do a 24-hour interval to get a reliable measurement, and in that amount of time I can just calculate it from the run time and % completed.


I do not notice this. My machine is a 1.8 GHz 64-bit Xeon with 10240 KBytes of Cache. It is running four ClimatePrediction work units, two hadcm3s, one hadam4, and one hadam4h, and no other boinc work units lately.

$ iostat -p sdd -tk 3600 24
Linux 2.6.32-754.23.1.el6.x86_64 (DellT7600.localdomain) 	10/25/2019 	_x86_64_	(4 CPU)

10/25/2019 12:43:51 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.61   27.42    1.76    4.21    0.00   64.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              48.84      4791.36       431.42    2732513     246041
sdd1              1.28         4.71         0.01       2688          8
sdd2              1.05         3.73         0.02       2129          9
sdd3             46.41      4782.49       431.39    2727452     246024

10/25/2019 01:43:51 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.65   94.08    1.16    0.05    0.00    0.06

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              12.24         4.36       527.88      15684    1900384
sdd1              0.00         0.00         0.00          0          0
sdd2              0.00         0.00         0.00          0          0
sdd3             12.24         4.36       527.88      15684    1900384

10/25/2019 02:43:51 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.39   98.16    0.39    0.04    0.00    0.02

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              13.39         0.17       616.22        608    2218392
sdd1              0.00         0.00         0.00          0          0
sdd2              0.00         0.00         0.00          0          0
sdd3             13.39         0.17       616.22        608    2218392

10/25/2019 03:43:51 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.05   98.18    1.70    0.05    0.00    0.01

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              13.23        13.68       602.85      49260    2170276
sdd1              0.01         0.07         0.00        256          0
sdd2              0.00         0.00         0.00          0          0
sdd3             13.22        13.61       602.85      49004    2170276

10/25/2019 04:43:51 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.54   94.81    1.57    0.04    0.00    0.04

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              12.90         0.97       593.14       3488    2135312
sdd1              0.00         0.00         0.00          0          0
sdd2              0.00         0.00         0.00          0          0
sdd3             12.90         0.97       593.14       3488    2135312

10/25/2019 05:43:51 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.99   94.81    1.10    0.06    0.00    0.04

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              13.83         4.71       706.16      16972    2542168
sdd1              0.00         0.00         0.00          0          0
sdd2              0.00         0.00         0.00          0          0
sdd3             13.83         4.71       706.16      16972    2542168

10/25/2019 06:43:51 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.78   95.10    1.03    0.04    0.00    0.05

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              13.04         1.69       591.13       6080    2128076
sdd1              0.00         0.00         0.00          0          0
sdd2              0.00         0.00         0.00          0          0
sdd3             13.04         1.69       591.13       6080    2128076

10/25/2019 07:43:51 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.97   96.05    0.88    0.06    0.00    0.03

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              12.90         0.13       595.69        476    2144480
sdd1              0.00         0.00         0.00          0          0
sdd2              0.00         0.00         0.00          0          0
sdd3             12.90         0.13       595.69        476    2144480

10/25/2019 08:43:51 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.85   96.25    0.82    0.04    0.00    0.04

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              12.68         1.38       582.83       4984    2098184
sdd1              0.00         0.00         0.00          0          0
sdd2              0.00         0.00         0.00          0          0
sdd3             12.68         1.38       582.83       4984    2098184

ID: 61380 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61381 - Posted: 25 Oct 2019, 15:09:34 UTC - in response to Message 61380.  
Last modified: 25 Oct 2019, 15:11:58 UTC

I do not notice this. My machine is a 1.8 GHz 64-bit Xeon with 10240 KBytes of Cache. It is running four ClimatePrediction work units, two hadcm3s, one hadam4, and one hadam4h, and no other boinc work units lately.

I am running only the hadam4h now (N216), and did not notice the variation on the other ones either, which are considerably smaller.
You have a lot of cache per core also, which may reduce any variations. I think we need to check on each machine to see what works.
ID: 61381 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61382 - Posted: 25 Oct 2019, 16:15:04 UTC

What the long time between checkpoints does mean for these tasks is that on computers that get switched off several times a day will never finish because if they have not reached the first checkpoint they will restart from the beginning.

If your computer is one of these, please use suspend either to RAM or to disk instead of just switching off.
ID: 61382 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61383 - Posted: 25 Oct 2019, 19:31:28 UTC - in response to Message 61382.  
Last modified: 25 Oct 2019, 19:34:21 UTC

What the long time between checkpoints does mean for these tasks is that on computers that get switched off several times a day will never finish because if they have not reached the first checkpoint they will restart from the beginning.

If your computer is one of these, please use suspend either to RAM or to disk instead of just switching off.


I try to run my machine 24/7 for about a month at a time. Basically, whenever Red Hat send me a new OS kernel. But once in a while I need to do it more often. I had not thought about the problem of restarting a work unit with such a long interval between checkpoints. My last checkpoint was 175:33:54 ago.

I did not think about the effect of restarting a long time after a checkpoint, since my default interval is 600 seconds (not applicable here). But I did set no new tasks for any project, and suspend all the ClimatePrediction work units. I then shut down the boinc client before rebooting my machine. Just because of that problems with the UK Met Office HadAM4 at N144 resolution v8.08 i686-pc-linux-gnu work units.
ID: 61383 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61390 - Posted: 26 Oct 2019, 7:58:43 UTC

Just checked, suspending computation and shutting BOINC down does not stop this. You do need to use suspend rather than a hard reboot.
ID: 61390 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 61392 - Posted: 26 Oct 2019, 9:44:51 UTC - in response to Message 61308.  

In reply to Jim1348 in re Ryzen 3700X

I'm about to take delivery of a Ryzen 3700X (32MB L3 cache, though I gather access is constrained to 8MB per 2 cores (4 threads)); I'll be interested to see how that behaves as and when it gets some CPDN work to do (and will probably do some bulk tests with WCG MIP1 to get an idea if there's no CPDN work available!)

Cheers - Al.

[1] Someone over at WCG seemed to think 5MB cache was what a MIP1 job would like. The user offered no justification for that number but 4MB probably isn't enough for near-optimum performance.

Thanks a lot for the cache info. I was beginning to think that the issues were deeper than I had found.
I just happen to have a Ryzen 3700x, and was wondering what its large L3 cache would do here. But I would need to add more memory. So let us know, and I could do it.


I have finally got the beast up and running on Ubuntu 18.04-3 (kernel 5.0.0-32-generic). It has 32GB of 3200MHz RAM, boots from an NVMe SSD, and I've put /var on HDD RAID 1 so that logs and checkpoint files aren't hammering the SSD. (/home is on RAID as well - all my non-laptop builds are done like that...) I haven't done any tuning apart from making sure that the memory clock and fabric clock are fixed at 3200 and 1600 respectively.

It has taken until now to get a decent work-load built up; I'm currently running 12 WCG tasks (with a check to stop MIP1 from running more than two at a time) and 2 CPDN HadAM4h tasks at a time, along with one GPU task from SETI@Home, Einstein@Home or MilkyWay@Home, so the system is getting a fair work-out. It seems to have all clocks at about 3.95GHz, and the machine is drawing about 140W not counting the GPU.

As regards checkpointing and completion times, after the first checkpoint (which seems to cover a few more time steps) it seems to checkpoint about every 60 minutes. I haven't had one generate a trickle yet but at current rate of progress I expect a trickle at about 33 hours 20 minutes, and the tasks to finish in about 5 days 13 hours.

I'm going to let the machine run with that sort of work load for a while to make sure it's behaving consistently, and I plan on doing some experiments with more HadAM4h tasks running at once when the current two have finished. It might be interesting to find out how many I can run at once without serious degradation it the only other work on the machine is WCG MCM1 (which is very cache-friendly!)

I'll try to do some task-level performance stats at some point, but on AMD CPUs there's no direct way of getting a count of L3 cache misses (I think it counts them at the cache level rather than the CPU level...) so one key stat isn't available. Ah, well...

Hope this was of interest - Al.
ID: 61392 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61393 - Posted: 26 Oct 2019, 11:29:10 UTC - in response to Message 61392.  
Last modified: 26 Oct 2019, 11:48:51 UTC

I'm currently running 12 WCG tasks (with a check to stop MIP1 from running more than two at a time) and 2 CPDN HadAM4h tasks at a time, along with one GPU task from SETI@Home, Einstein@Home or MilkyWay@Home, so the system is getting a fair work-out.

I am finding that my Ryzen 3700x begins to fall off a cliff of sorts beyond three HadAM4h (N216). Above that, the write-rate gets erratic, and begins to fall off.
So I will use an app_config.xml to limit it to three, and run WCG on the other cores.

That is a bit surprising, with the 32 MB L3 cache, so there is some other limiting factor.

EDIT: Of course, with the WCG also running, I may need to limit the HadAM4h even more, down to two. I will be monitoring it for a while.
ID: 61393 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61408 - Posted: 27 Oct 2019, 16:36:03 UTC
Last modified: 27 Oct 2019, 17:17:34 UTC

I just completed my first four on my i7-9700. They all went swimmingly, completing in a little over 7 days. I ended up on four cores, but they initially ran on eight.
The next group of four will be the same, but the one after that may be a little faster.
https://www.cpdn.org/results.php?hostid=1493890

But there will be more of a learning curve on this one for what works for each machine. I hope it holds true for OpenIFS too, or we have to start all over.
ID: 61408 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution

©2024 cpdn.org