climateprediction.net home page
Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested

Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested

Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,321,884
RAC: 91,453
Message 71645 - Posted: 16 Oct 2024, 19:10:45 UTC - in response to Message 71644.  

Anyone running Ubuntu 20.04 LTS or anything based on this or earlier, won't have proper glibc versions to run it. Probably also Ubuntu 22.04 LTS will have a problem, but I don't have an installation of that to test it.

Ran fine on Ubuntu 22.04.4 LTS [5.15.0-119-generic|libc 2.35]
ID: 71645 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2185
Credit: 64,822,615
RAC: 5,275
Message 71646 - Posted: 16 Oct 2024, 20:33:30 UTC - in response to Message 71645.  

Anyone running Ubuntu 20.04 LTS or anything based on this or earlier, won't have proper glibc versions to run it. Probably also Ubuntu 22.04 LTS will have a problem, but I don't have an installation of that to test it.

Ran fine on Ubuntu 22.04.4 LTS [5.15.0-119-generic|libc 2.35]

Good to know. Thanks.
ID: 71646 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71647 - Posted: 16 Oct 2024, 20:47:04 UTC - in response to Message 71644.  

I tried running it on a Linux Mint 20.3 PC, fully updated, and it gave the following errors about glibc versions

++ ./oifs_43r3_omp_model.exe
./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./oifs_43r3_omp_model.exe)


I run Red Hat Enterprise Linux 8.10 and it has GLIBC 2.28.

glibc-2.28-251.el8_10.5.x86_64
glibc-2.28-251.el8_10.5.i686

So I hope the tasks distributed by the Boinc server will be statically linked.
ID: 71647 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 71648 - Posted: 16 Oct 2024, 21:20:07 UTC - in response to Message 71646.  

Thanks. Yes I forgot about older systems. This is my test executable built on the Ubuntu 22:04. I'll make sure it'll run on older systems when we release it.
---
CPDN Visiting Scientist
ID: 71648 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,555,761
RAC: 58,597
Message 71649 - Posted: 16 Oct 2024, 23:29:17 UTC - in response to Message 71648.  

I have a question for newer systems. :-P
I dissembled oifs_43r3_omp_model.exe test binary and it doesn't have avx2 or avx512 instructions. Meanwhile, the binaries for past jobs (e.g. oifs_43r3_bl_1.13_x86_64-pc-linux-gnu) had both. I suppose that's just different build flags, where the production release can do runtime detection of ISA support and pick the optimal path? It would be great to not get stuck on 2000s ISA features. :-) Not a big deal for purpose of testing memory usage, but that could mean the runtime we get would be conservative on newer machines.
ID: 71649 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,555,761
RAC: 58,597
Message 71650 - Posted: 17 Oct 2024, 3:11:21 UTC - in response to Message 71638.  

Finally got around to run it and nice to be able answer some of my own questions.

The memory pattern looks exactly like other OpenIFS tasks just bigger. 90% of the time it's below 20G but the other 10% goes up to 25G. On my 32GB mini-PC with Ryzen 7840HS, I was able to use the computer normally with multiple browser tabs open and playing videos. I feel 32G system should be pretty solid to run one task, at least on native Linux. That's an idle host though. If one also runs other memory heavy projects like ATLAS, it's probably game over. Boinc client can still start the task blissfully unaware of the looming 25G spike that's going to OOM the host. :-D

Each run wrote 35-40GB of data to disk. It averages to around ~30GB per hour with 2 threads. Using 4 threads can get to ~50GB write per hour. On the same host, previous OpenIFS task wrote ~50GB in 6 hours. While the disk load is heavier per task, if I account for the memory need, I can only run one instead of five concurrently. The overall disk IO is similar on average so no concerns there.
ID: 71650 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 69,917,734
RAC: 9,024
Message 71651 - Posted: 17 Oct 2024, 6:55:42 UTC

I do not have any problem with running oifs acording your specifications. I might bring on 4 real linux computers igual or more than 32 GB RAM and at least 10 GB of disc space. And a virtual one.
ID: 71651 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,784,587
RAC: 19,387
Message 71653 - Posted: 17 Oct 2024, 9:09:25 UTC - in response to Message 71636.  

Glenn, have you had a chance to test this multi-core OIFS app on WSL2? I think the only multi-core BOINC app I've run on WSL2 is LHC Atlas and I've never been able to get it to run multi-core and had to resort to running it single core.
Did you set e.g. 'processors=4' in the .wslconfig ? I've not tried it. I'll let you know.

Tried the demo app with 2 and 4 cores on WSL2 Ubuntu 22.04 and both worked. The 4 core (thread) setup took 1.5 hours, the rest of threads were filled with Rosetta.

Is it as important to dedicate whole cores for this app as it is with W@H?
ID: 71653 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 71654 - Posted: 17 Oct 2024, 9:57:13 UTC - in response to Message 71649.  

I dissembled oifs_43r3_omp_model.exe test binary and it doesn't have avx2 or avx512 instructions. Meanwhile, the binaries for past jobs (e.g. oifs_43r3_bl_1.13_x86_64-pc-linux-gnu) had both. I suppose that's just different build flags, where the production release can do runtime detection of ISA support and pick the optimal path?
That is interesting. I deliberately do not build with avx because I have not tested the impact on the model results. I've read articles that enabling it has a detrimental impact on results from atmospheric codes. I didn't build the BL variant - I will check that! We worry about reproducibility with these codes more than outright speed.
---
CPDN Visiting Scientist
ID: 71654 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 71655 - Posted: 17 Oct 2024, 10:08:18 UTC - in response to Message 71650.  

I feel 32G system should be pretty solid to run one task, at least on native Linux. That's an idle host though. If one also runs other memory heavy projects like ATLAS, it's probably game over. Boinc client can still start the task blissfully unaware of the looming 25G spike that's going to OOM the host. :-D
This is a point I was going to get to. You might recall we had issues with the boinc client starting more OpenIFS tasks than the machine had memory for. The problem was the client effectively ignored the memory bound we set in the task's XML and instead relied on watching the early memory usage.

We reported this issue and David Anderson has now fixed that bug. Andy@CPDN has tested that it works. It will be rolled out with client 8.0.4 and we'll send out a note encouraging people to upgrade. That will help with the lower resolution OpenIFS tasks and might(?) help with running mixed projects. The fix also needs server changes & enabling in the XML so only the projects with large memory tasks will probably use it.

And thanks for all the feedback, makes me more confident this won't cause problems.
---
CPDN Visiting Scientist
ID: 71655 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 71656 - Posted: 17 Oct 2024, 10:11:12 UTC - in response to Message 71653.  

Is it as important to dedicate whole cores for this app as it is with W@H?
Definitely. Any code with lots of floating point works better with a real core not a thread.
ID: 71656 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 71657 - Posted: 17 Oct 2024, 14:25:05 UTC

Some timings from my machines:

Intel i5-14400, Linux Mint 22 - 4 threads 36 minutes, 2 threads 68 minutes
Intel i5-9600KF, Linux Mint 21 - 4 threads 59 minutes, 2 threads 140 minutes
Linux Mint 20 - failed to run because of the GLIBC version mis-match problem.
ID: 71657 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 71658 - Posted: 17 Oct 2024, 15:38:20 UTC - in response to Message 71657.  
Last modified: 17 Oct 2024, 15:39:41 UTC

Some timings from my machines:
Intel i5-14400, Linux Mint 22 - 4 threads 36 minutes, 2 threads 68 minutes
Intel i5-9600KF, Linux Mint 21 - 4 threads 59 minutes, 2 threads 140 minutes
Linux Mint 20 - failed to run because of the GLIBC version mis-match problem.
As the demo model is configured for 2 days and the batches will be run for ~6 simulated months, multiply runtimes by 90. Which gives:

i5-14400 : 4 cores, 54 hrs ; 2 cores, 102 hrs (4.25 days)
i5-9600 : 4 cores, 88 hrs ; 2 cores, 210 hrs (8.75 days)

Make the default 4 cores then instead of 2 for this configuration? Especially if we stick to 1 task per host.
---
CPDN Visiting Scientist
ID: 71658 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,784,587
RAC: 19,387
Message 71659 - Posted: 17 Oct 2024, 17:27:05 UTC - in response to Message 71658.  

Make the default 4 cores then instead of 2 for this configuration? Especially if we stick to 1 task per host.

I'd say at the very least make the core count user configurable from the get go, regardless of which default you'll choose.
ID: 71659 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 71660 - Posted: 17 Oct 2024, 18:13:50 UTC - in response to Message 71659.  

I'd say at the very least make the core count user configurable from the get go, regardless of which default you'll choose.
I'd prefer not to. From previous experience it's easier to debug remote problems with the setup the same. Options can come later once we are sure it all works satisfactorily.
---
CPDN Visiting Scientist
ID: 71660 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71661 - Posted: 17 Oct 2024, 19:08:39 UTC - in response to Message 71655.  

We reported this issue and David Anderson has now fixed that bug. Andy@CPDN has tested that it works. It will be rolled out with client 8.0.4 and we'll send out a note encouraging people to upgrade.


Well, my RHEL8 system is running 7.20.2 Boinc Client. I examined RHEL9 and it runs the same Boinc Client.
ID: 71661 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,555,761
RAC: 58,597
Message 71662 - Posted: 18 Oct 2024, 3:06:25 UTC - in response to Message 71661.  

Well, my RHEL8 system is running 7.20.2 Boinc Client. I examined RHEL9 and it runs the same Boinc Client.

It's not even officially released yet and it likely would take another major release for non-rolling distros to onboard the new version after official release. RHEL might be further behind.

If we'd encourage people to upgrade, the opt-in page would be a nice place to dump all these info. Hopefully 8.0.4 would have been officially released by then. Otherwise advising people to install an alpha or beta release would be a bit weird, especially when CPDN isn't responsible for any of the potential bugs there.
ID: 71662 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,971,712
RAC: 21,921
Message 71663 - Posted: 18 Oct 2024, 4:58:57 UTC

Otherwise advising people to install an alpha or beta release would be a bit weird, especially when CPDN isn't responsible for any of the potential bugs there.


Agreed. Though my Linux installation is 8.1.0 currently.
ID: 71663 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,555,761
RAC: 58,597
Message 71664 - Posted: 18 Oct 2024, 19:36:57 UTC

With this benchmark, I guess I can finally get cleaner data for my previous analysis about affinity and THP.

TLDR:
* CPU affinity doesn't matter. The timing is the same so I didn't bother to put them into a table.
* THP has major benefits, at least on Zen 4.
* This higher resolution model scales up to 4 cores pretty nicely.

Detailed data below. All timing are in minutes. No thread affinity was applied for these runs, so it's OS doing the scheduling.

AMD Ryzen 7 7840HS, 8C/16T, 45W TDP, 2x16G DDR5-5600, Ubuntu 24.04

4C/4T, THP vs no THP, idle host

Config      Run 1    Run 2    Run 3    Average    Speed Up
--------  -------  -------  -------  ---------  ----------
THP off     40.09    40.04    40.05      40.06        1.00
THP on      34.98    34.99    35.05      35.01        1.14

I also watched huge page stats during the run for a few times. It's covering at least 90% of the OpenIFS memory usage, following its usage pattern. Looks pretty effective.

Thread scaling, idle host, THP on

Config      Run 1    Run 2    Run 3    Average    Speed Up
--------  -------  -------  -------  ---------  ----------
1C1T       109.00   108.50   107.90     108.47        1.00
2C2T        59.52    59.73    59.74      59.66        1.82
4C4T        34.83    35.00    35.01      34.95        3.10
8C8T        25.32    25.26    25.21      25.26        4.29

Thread scaling, busy host, THP on

Config      Run 1    Run 2    Run 3    Average    Speed Up
--------  -------  -------  -------  ---------  ----------
1C1T       133.10   132.80   132.73     132.88        1.00
2C2T        69.33    69.37    69.35      69.35        1.92
4C4T        38.23    38.33    38.26      38.27        3.47
8C8T        25.32    25.26    25.21      25.26        5.26

To keep the host busy, I used SiDock@Home to fill the remaining core, one task per core. Thus the host always had 8 threads running at 100%.

Compared to Glenn's previous data, the higher resolution model scales quite a bit better. If all cores are busy anyway, I only lose half a core worth of compute at 4 threads. What's interesting is that in Glenn's data, a busy host scales worse, but mine scales better, even though the actual runtime are all longer. I suspect that my idle host scaling is worse because of the much higher boost frequency when fewer cores are active. The gap between single-core and all-core boost is usually much larger on power limited mobile chips, which could explain this. I didn't take frequency reading though.
ID: 71664 · Report as offensive     Reply Quote
Vato

Send message
Joined: 4 Oct 19
Posts: 15
Credit: 9,174,915
RAC: 3,722
Message 71665 - Posted: 18 Oct 2024, 20:52:55 UTC - in response to Message 71638.  

If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS' but I suggest keeping to 4 or less as the parallel efficiency drops off markedly beyond 4 threads.


Hi Glenn - if the WU is using virtually all the memory on a machine, why would we worry about the efficiency dropping off? From my PoV, giving the WU all the cores is the best overall performance in this case. The extra cores running at only (e.g.) 20% efficiency, is still more work done per unit time. Or is the synchronisation required really that heavyweight?
ID: 71665 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested

©2024 cpdn.org