Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested

Author	Message
PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,309,459 RAC: 92,276	Message 71645 - Posted: 16 Oct 2024, 19:10:45 UTC - in response to Message 71644. Anyone running Ubuntu 20.04 LTS or anything based on this or earlier, won't have proper glibc versions to run it. Probably also Ubuntu 22.04 LTS will have a problem, but I don't have an installation of that to test it. Ran fine on Ubuntu 22.04.4 LTS [5.15.0-119-generic\|libc 2.35] ID: 71645 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 71646 - Posted: 16 Oct 2024, 20:33:30 UTC - in response to Message 71645. Anyone running Ubuntu 20.04 LTS or anything based on this or earlier, won't have proper glibc versions to run it. Probably also Ubuntu 22.04 LTS will have a problem, but I don't have an installation of that to test it. Ran fine on Ubuntu 22.04.4 LTS [5.15.0-119-generic\|libc 2.35] Good to know. Thanks. ID: 71646 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 71647 - Posted: 16 Oct 2024, 20:47:04 UTC - in response to Message 71644. I tried running it on a Linux Mint 20.3 PC, fully updated, and it gave the following errors about glibc versions ++ ./oifs_43r3_omp_model.exe ./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./oifs_43r3_omp_model.exe) I run Red Hat Enterprise Linux 8.10 and it has GLIBC 2.28. glibc-2.28-251.el8_10.5.x86_64 glibc-2.28-251.el8_10.5.i686 So I hope the tasks distributed by the Boinc server will be statically linked. ID: 71647 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71648 - Posted: 16 Oct 2024, 21:20:07 UTC - in response to Message 71646. Thanks. Yes I forgot about older systems. This is my test executable built on the Ubuntu 22:04. I'll make sure it'll run on older systems when we release it. --- CPDN Visiting Scientist ID: 71648 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,541,636 RAC: 58,436	Message 71649 - Posted: 16 Oct 2024, 23:29:17 UTC - in response to Message 71648. I have a question for newer systems. :-P I dissembled oifs_43r3_omp_model.exe test binary and it doesn't have avx2 or avx512 instructions. Meanwhile, the binaries for past jobs (e.g. oifs_43r3_bl_1.13_x86_64-pc-linux-gnu) had both. I suppose that's just different build flags, where the production release can do runtime detection of ISA support and pick the optimal path? It would be great to not get stuck on 2000s ISA features. :-) Not a big deal for purpose of testing memory usage, but that could mean the runtime we get would be conservative on newer machines. ID: 71649 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,541,636 RAC: 58,436	Message 71650 - Posted: 17 Oct 2024, 3:11:21 UTC - in response to Message 71638. Finally got around to run it and nice to be able answer some of my own questions. The memory pattern looks exactly like other OpenIFS tasks just bigger. 90% of the time it's below 20G but the other 10% goes up to 25G. On my 32GB mini-PC with Ryzen 7840HS, I was able to use the computer normally with multiple browser tabs open and playing videos. I feel 32G system should be pretty solid to run one task, at least on native Linux. That's an idle host though. If one also runs other memory heavy projects like ATLAS, it's probably game over. Boinc client can still start the task blissfully unaware of the looming 25G spike that's going to OOM the host. :-D Each run wrote 35-40GB of data to disk. It averages to around ~30GB per hour with 2 threads. Using 4 threads can get to ~50GB write per hour. On the same host, previous OpenIFS task wrote ~50GB in 6 hours. While the disk load is heavier per task, if I account for the memory need, I can only run one instead of five concurrently. The overall disk IO is similar on average so no concerns there. ID: 71650 · Reply Quote

klepel Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,916,905 RAC: 9,226	Message 71651 - Posted: 17 Oct 2024, 6:55:42 UTC I do not have any problem with running oifs acording your specifications. I might bring on 4 real linux computers igual or more than 32 GB RAM and at least 10 GB of disc space. And a virtual one. ID: 71651 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,780,446 RAC: 19,423	Message 71653 - Posted: 17 Oct 2024, 9:09:25 UTC - in response to Message 71636. Glenn, have you had a chance to test this multi-core OIFS app on WSL2? I think the only multi-core BOINC app I've run on WSL2 is LHC Atlas and I've never been able to get it to run multi-core and had to resort to running it single core. Did you set e.g. 'processors=4' in the .wslconfig ? I've not tried it. I'll let you know. Tried the demo app with 2 and 4 cores on WSL2 Ubuntu 22.04 and both worked. The 4 core (thread) setup took 1.5 hours, the rest of threads were filled with Rosetta. Is it as important to dedicate whole cores for this app as it is with W@H? ID: 71653 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71654 - Posted: 17 Oct 2024, 9:57:13 UTC - in response to Message 71649. I dissembled oifs_43r3_omp_model.exe test binary and it doesn't have avx2 or avx512 instructions. Meanwhile, the binaries for past jobs (e.g. oifs_43r3_bl_1.13_x86_64-pc-linux-gnu) had both. I suppose that's just different build flags, where the production release can do runtime detection of ISA support and pick the optimal path? That is interesting. I deliberately do not build with avx because I have not tested the impact on the model results. I've read articles that enabling it has a detrimental impact on results from atmospheric codes. I didn't build the BL variant - I will check that! We worry about reproducibility with these codes more than outright speed. --- CPDN Visiting Scientist ID: 71654 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71655 - Posted: 17 Oct 2024, 10:08:18 UTC - in response to Message 71650. I feel 32G system should be pretty solid to run one task, at least on native Linux. That's an idle host though. If one also runs other memory heavy projects like ATLAS, it's probably game over. Boinc client can still start the task blissfully unaware of the looming 25G spike that's going to OOM the host. :-D This is a point I was going to get to. You might recall we had issues with the boinc client starting more OpenIFS tasks than the machine had memory for. The problem was the client effectively ignored the memory bound we set in the task's XML and instead relied on watching the early memory usage. We reported this issue and David Anderson has now fixed that bug. Andy@CPDN has tested that it works. It will be rolled out with client 8.0.4 and we'll send out a note encouraging people to upgrade. That will help with the lower resolution OpenIFS tasks and might(?) help with running mixed projects. The fix also needs server changes & enabling in the XML so only the projects with large memory tasks will probably use it. And thanks for all the feedback, makes me more confident this won't cause problems. --- CPDN Visiting Scientist ID: 71655 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71656 - Posted: 17 Oct 2024, 10:11:12 UTC - in response to Message 71653. Is it as important to dedicate whole cores for this app as it is with W@H? Definitely. Any code with lots of floating point works better with a real core not a thread. ID: 71656 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,861 RAC: 10,559	Message 71657 - Posted: 17 Oct 2024, 14:25:05 UTC Some timings from my machines: Intel i5-14400, Linux Mint 22 - 4 threads 36 minutes, 2 threads 68 minutes Intel i5-9600KF, Linux Mint 21 - 4 threads 59 minutes, 2 threads 140 minutes Linux Mint 20 - failed to run because of the GLIBC version mis-match problem. ID: 71657 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71658 - Posted: 17 Oct 2024, 15:38:20 UTC - in response to Message 71657. Last modified: 17 Oct 2024, 15:39:41 UTC Some timings from my machines: Intel i5-14400, Linux Mint 22 - 4 threads 36 minutes, 2 threads 68 minutes Intel i5-9600KF, Linux Mint 21 - 4 threads 59 minutes, 2 threads 140 minutes Linux Mint 20 - failed to run because of the GLIBC version mis-match problem. As the demo model is configured for 2 days and the batches will be run for ~6 simulated months, multiply runtimes by 90. Which gives: i5-14400 : 4 cores, 54 hrs ; 2 cores, 102 hrs (4.25 days) i5-9600 : 4 cores, 88 hrs ; 2 cores, 210 hrs (8.75 days) Make the default 4 cores then instead of 2 for this configuration? Especially if we stick to 1 task per host. --- CPDN Visiting Scientist ID: 71658 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,780,446 RAC: 19,423	Message 71659 - Posted: 17 Oct 2024, 17:27:05 UTC - in response to Message 71658. Make the default 4 cores then instead of 2 for this configuration? Especially if we stick to 1 task per host. I'd say at the very least make the core count user configurable from the get go, regardless of which default you'll choose. ID: 71659 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71660 - Posted: 17 Oct 2024, 18:13:50 UTC - in response to Message 71659. I'd say at the very least make the core count user configurable from the get go, regardless of which default you'll choose. I'd prefer not to. From previous experience it's easier to debug remote problems with the setup the same. Options can come later once we are sure it all works satisfactorily. --- CPDN Visiting Scientist ID: 71660 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 71661 - Posted: 17 Oct 2024, 19:08:39 UTC - in response to Message 71655. We reported this issue and David Anderson has now fixed that bug. Andy@CPDN has tested that it works. It will be rolled out with client 8.0.4 and we'll send out a note encouraging people to upgrade. Well, my RHEL8 system is running 7.20.2 Boinc Client. I examined RHEL9 and it runs the same Boinc Client. ID: 71661 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,541,636 RAC: 58,436	Message 71662 - Posted: 18 Oct 2024, 3:06:25 UTC - in response to Message 71661. Well, my RHEL8 system is running 7.20.2 Boinc Client. I examined RHEL9 and it runs the same Boinc Client. It's not even officially released yet and it likely would take another major release for non-rolling distros to onboard the new version after official release. RHEL might be further behind. If we'd encourage people to upgrade, the opt-in page would be a nice place to dump all these info. Hopefully 8.0.4 would have been officially released by then. Otherwise advising people to install an alpha or beta release would be a bit weird, especially when CPDN isn't responsible for any of the potential bugs there. ID: 71662 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,966,742 RAC: 21,869	Message 71663 - Posted: 18 Oct 2024, 4:58:57 UTC Otherwise advising people to install an alpha or beta release would be a bit weird, especially when CPDN isn't responsible for any of the potential bugs there. Agreed. Though my Linux installation is 8.1.0 currently. ID: 71663 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,541,636 RAC: 58,436	Message 71664 - Posted: 18 Oct 2024, 19:36:57 UTC With this benchmark, I guess I can finally get cleaner data for my previous analysis about affinity and THP. TLDR: * CPU affinity doesn't matter. The timing is the same so I didn't bother to put them into a table. * THP has major benefits, at least on Zen 4. * This higher resolution model scales up to 4 cores pretty nicely. Detailed data below. All timing are in minutes. No thread affinity was applied for these runs, so it's OS doing the scheduling. AMD Ryzen 7 7840HS, 8C/16T, 45W TDP, 2x16G DDR5-5600, Ubuntu 24.04 4C/4T, THP vs no THP, idle host Config Run 1 Run 2 Run 3 Average Speed Up -------- ------- ------- ------- --------- ---------- THP off 40.09 40.04 40.05 40.06 1.00 THP on 34.98 34.99 35.05 35.01 1.14 I also watched huge page stats during the run for a few times. It's covering at least 90% of the OpenIFS memory usage, following its usage pattern. Looks pretty effective. Thread scaling, idle host, THP on Config Run 1 Run 2 Run 3 Average Speed Up -------- ------- ------- ------- --------- ---------- 1C1T 109.00 108.50 107.90 108.47 1.00 2C2T 59.52 59.73 59.74 59.66 1.82 4C4T 34.83 35.00 35.01 34.95 3.10 8C8T 25.32 25.26 25.21 25.26 4.29 Thread scaling, busy host, THP on Config Run 1 Run 2 Run 3 Average Speed Up -------- ------- ------- ------- --------- ---------- 1C1T 133.10 132.80 132.73 132.88 1.00 2C2T 69.33 69.37 69.35 69.35 1.92 4C4T 38.23 38.33 38.26 38.27 3.47 8C8T 25.32 25.26 25.21 25.26 5.26 To keep the host busy, I used SiDock@Home to fill the remaining core, one task per core. Thus the host always had 8 threads running at 100%. Compared to Glenn's previous data, the higher resolution model scales quite a bit better. If all cores are busy anyway, I only lose half a core worth of compute at 4 threads. What's interesting is that in Glenn's data, a busy host scales worse, but mine scales better, even though the actual runtime are all longer. I suspect that my idle host scaling is worse because of the much higher boost frequency when fewer cores are active. The gap between single-core and all-core boost is usually much larger on power limited mobile chips, which could explain this. I didn't take frequency reading though. ID: 71664 · Reply Quote

Vato Send message Joined: 4 Oct 19 Posts: 15 Credit: 9,174,915 RAC: 3,722	Message 71665 - Posted: 18 Oct 2024, 20:52:55 UTC - in response to Message 71638. If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS' but I suggest keeping to 4 or less as the parallel efficiency drops off markedly beyond 4 threads. Hi Glenn - if the WU is using virtually all the memory on a machine, why would we worry about the efficiency dropping off? From my PoV, giving the WU all the cores is the best overall performance in this case. The extra cores running at only (e.g.) 20% efficiency, is still more work done per unit time. Or is the synchronisation required really that heavyweight? ID: 71665 · Reply Quote