Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,562,423 RAC: 89,234 |
Anyone running Ubuntu 20.04 LTS or anything based on this or earlier, won't have proper glibc versions to run it. Probably also Ubuntu 22.04 LTS will have a problem, but I don't have an installation of that to test it. Ran fine on Ubuntu 22.04.4 LTS [5.15.0-119-generic|libc 2.35] |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Anyone running Ubuntu 20.04 LTS or anything based on this or earlier, won't have proper glibc versions to run it. Probably also Ubuntu 22.04 LTS will have a problem, but I don't have an installation of that to test it. Good to know. Thanks. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I tried running it on a Linux Mint 20.3 PC, fully updated, and it gave the following errors about glibc versions I run Red Hat Enterprise Linux 8.10 and it has GLIBC 2.28. glibc-2.28-251.el8_10.5.x86_64 glibc-2.28-251.el8_10.5.i686 So I hope the tasks distributed by the Boinc server will be statically linked. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Thanks. Yes I forgot about older systems. This is my test executable built on the Ubuntu 22:04. I'll make sure it'll run on older systems when we release it. --- CPDN Visiting Scientist |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,790,501 RAC: 64,282 |
I have a question for newer systems. :-P I dissembled oifs_43r3_omp_model.exe test binary and it doesn't have avx2 or avx512 instructions. Meanwhile, the binaries for past jobs (e.g. oifs_43r3_bl_1.13_x86_64-pc-linux-gnu) had both. I suppose that's just different build flags, where the production release can do runtime detection of ISA support and pick the optimal path? It would be great to not get stuck on 2000s ISA features. :-) Not a big deal for purpose of testing memory usage, but that could mean the runtime we get would be conservative on newer machines. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,790,501 RAC: 64,282 |
Finally got around to run it and nice to be able answer some of my own questions. The memory pattern looks exactly like other OpenIFS tasks just bigger. 90% of the time it's below 20G but the other 10% goes up to 25G. On my 32GB mini-PC with Ryzen 7840HS, I was able to use the computer normally with multiple browser tabs open and playing videos. I feel 32G system should be pretty solid to run one task, at least on native Linux. That's an idle host though. If one also runs other memory heavy projects like ATLAS, it's probably game over. Boinc client can still start the task blissfully unaware of the looming 25G spike that's going to OOM the host. :-D Each run wrote 35-40GB of data to disk. It averages to around ~30GB per hour with 2 threads. Using 4 threads can get to ~50GB write per hour. On the same host, previous OpenIFS task wrote ~50GB in 6 hours. While the disk load is heavier per task, if I account for the memory need, I can only run one instead of five concurrently. The overall disk IO is similar on average so no concerns there. |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,923,532 RAC: 8,011 |
I do not have any problem with running oifs acording your specifications. I might bring on 4 real linux computers igual or more than 32 GB RAM and at least 10 GB of disc space. And a virtual one. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,845,927 RAC: 19,699 |
Glenn, have you had a chance to test this multi-core OIFS app on WSL2? I think the only multi-core BOINC app I've run on WSL2 is LHC Atlas and I've never been able to get it to run multi-core and had to resort to running it single core.Did you set e.g. 'processors=4' in the .wslconfig ? I've not tried it. I'll let you know. Tried the demo app with 2 and 4 cores on WSL2 Ubuntu 22.04 and both worked. The 4 core (thread) setup took 1.5 hours, the rest of threads were filled with Rosetta. Is it as important to dedicate whole cores for this app as it is with W@H? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I dissembled oifs_43r3_omp_model.exe test binary and it doesn't have avx2 or avx512 instructions. Meanwhile, the binaries for past jobs (e.g. oifs_43r3_bl_1.13_x86_64-pc-linux-gnu) had both. I suppose that's just different build flags, where the production release can do runtime detection of ISA support and pick the optimal path?That is interesting. I deliberately do not build with avx because I have not tested the impact on the model results. I've read articles that enabling it has a detrimental impact on results from atmospheric codes. I didn't build the BL variant - I will check that! We worry about reproducibility with these codes more than outright speed. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I feel 32G system should be pretty solid to run one task, at least on native Linux. That's an idle host though. If one also runs other memory heavy projects like ATLAS, it's probably game over. Boinc client can still start the task blissfully unaware of the looming 25G spike that's going to OOM the host. :-DThis is a point I was going to get to. You might recall we had issues with the boinc client starting more OpenIFS tasks than the machine had memory for. The problem was the client effectively ignored the memory bound we set in the task's XML and instead relied on watching the early memory usage. We reported this issue and David Anderson has now fixed that bug. Andy@CPDN has tested that it works. It will be rolled out with client 8.0.4 and we'll send out a note encouraging people to upgrade. That will help with the lower resolution OpenIFS tasks and might(?) help with running mixed projects. The fix also needs server changes & enabling in the XML so only the projects with large memory tasks will probably use it. And thanks for all the feedback, makes me more confident this won't cause problems. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Is it as important to dedicate whole cores for this app as it is with W@H?Definitely. Any code with lots of floating point works better with a real core not a thread. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Some timings from my machines: Intel i5-14400, Linux Mint 22 - 4 threads 36 minutes, 2 threads 68 minutes Intel i5-9600KF, Linux Mint 21 - 4 threads 59 minutes, 2 threads 140 minutes Linux Mint 20 - failed to run because of the GLIBC version mis-match problem. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Some timings from my machines:As the demo model is configured for 2 days and the batches will be run for ~6 simulated months, multiply runtimes by 90. Which gives: i5-14400 : 4 cores, 54 hrs ; 2 cores, 102 hrs (4.25 days) i5-9600 : 4 cores, 88 hrs ; 2 cores, 210 hrs (8.75 days) Make the default 4 cores then instead of 2 for this configuration? Especially if we stick to 1 task per host. --- CPDN Visiting Scientist |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,845,927 RAC: 19,699 |
Make the default 4 cores then instead of 2 for this configuration? Especially if we stick to 1 task per host. I'd say at the very least make the core count user configurable from the get go, regardless of which default you'll choose. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I'd say at the very least make the core count user configurable from the get go, regardless of which default you'll choose.I'd prefer not to. From previous experience it's easier to debug remote problems with the setup the same. Options can come later once we are sure it all works satisfactorily. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
We reported this issue and David Anderson has now fixed that bug. Andy@CPDN has tested that it works. It will be rolled out with client 8.0.4 and we'll send out a note encouraging people to upgrade. Well, my RHEL8 system is running 7.20.2 Boinc Client. I examined RHEL9 and it runs the same Boinc Client. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,790,501 RAC: 64,282 |
Well, my RHEL8 system is running 7.20.2 Boinc Client. I examined RHEL9 and it runs the same Boinc Client. It's not even officially released yet and it likely would take another major release for non-rolling distros to onboard the new version after official release. RHEL might be further behind. If we'd encourage people to upgrade, the opt-in page would be a nice place to dump all these info. Hopefully 8.0.4 would have been officially released by then. Otherwise advising people to install an alpha or beta release would be a bit weird, especially when CPDN isn't responsible for any of the potential bugs there. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762 |
Otherwise advising people to install an alpha or beta release would be a bit weird, especially when CPDN isn't responsible for any of the potential bugs there. Agreed. Though my Linux installation is 8.1.0 currently. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,790,501 RAC: 64,282 |
With this benchmark, I guess I can finally get cleaner data for my previous analysis about affinity and THP. TLDR: * CPU affinity doesn't matter. The timing is the same so I didn't bother to put them into a table. * THP has major benefits, at least on Zen 4. * This higher resolution model scales up to 4 cores pretty nicely. Detailed data below. All timing are in minutes. No thread affinity was applied for these runs, so it's OS doing the scheduling. AMD Ryzen 7 7840HS, 8C/16T, 45W TDP, 2x16G DDR5-5600, Ubuntu 24.04 4C/4T, THP vs no THP, idle host Config Run 1 Run 2 Run 3 Average Speed Up -------- ------- ------- ------- --------- ---------- THP off 40.09 40.04 40.05 40.06 1.00 THP on 34.98 34.99 35.05 35.01 1.14 I also watched huge page stats during the run for a few times. It's covering at least 90% of the OpenIFS memory usage, following its usage pattern. Looks pretty effective. Thread scaling, idle host, THP on Config Run 1 Run 2 Run 3 Average Speed Up -------- ------- ------- ------- --------- ---------- 1C1T 109.00 108.50 107.90 108.47 1.00 2C2T 59.52 59.73 59.74 59.66 1.82 4C4T 34.83 35.00 35.01 34.95 3.10 8C8T 25.32 25.26 25.21 25.26 4.29 Thread scaling, busy host, THP on Config Run 1 Run 2 Run 3 Average Speed Up -------- ------- ------- ------- --------- ---------- 1C1T 133.10 132.80 132.73 132.88 1.00 2C2T 69.33 69.37 69.35 69.35 1.92 4C4T 38.23 38.33 38.26 38.27 3.47 8C8T 25.32 25.26 25.21 25.26 5.26 To keep the host busy, I used SiDock@Home to fill the remaining core, one task per core. Thus the host always had 8 threads running at 100%. Compared to Glenn's previous data, the higher resolution model scales quite a bit better. If all cores are busy anyway, I only lose half a core worth of compute at 4 threads. What's interesting is that in Glenn's data, a busy host scales worse, but mine scales better, even though the actual runtime are all longer. I suspect that my idle host scaling is worse because of the much higher boost frequency when fewer cores are active. The gap between single-core and all-core boost is usually much larger on power limited mobile chips, which could explain this. I didn't take frequency reading though. |
Send message Joined: 4 Oct 19 Posts: 15 Credit: 9,174,915 RAC: 3,722 |
If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS' but I suggest keeping to 4 or less as the parallel efficiency drops off markedly beyond 4 threads. Hi Glenn - if the WU is using virtually all the memory on a machine, why would we worry about the efficiency dropping off? From my PoV, giving the WU all the cores is the best overall performance in this case. The extra cores running at only (e.g.) 20% efficiency, is still more work done per unit time. Or is the synchronisation required really that heavyweight? |
©2024 cpdn.org