Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,816,935 RAC: 19,934 |
Glenn, do you think these will come out before the upcoming Hadley ones or the other way around? Just curious. |
Send message Joined: 5 Aug 04 Posts: 126 Credit: 24,426,020 RAC: 23,705 |
4 GB checkpoint shouldn't be a problem, as long as it's not trying to re-write 4 GB every 10 seconds or something similarly ridiculous. As for 25 GB peak memory usage, first problem is, is it possible to virtualize Linux on top of Windows if you've only got 32 GB memory, or do you need at least 48 GB? Second problem is, for everyone like me that is not a sudo masochist, with the either confusing or outdated documentation just getting BOINC itself up-and-running is complicated. Add to the mix, is it still required to screw-around with installing 32-bit libraries? For running these large OpenIFS models it really boils down to: 1: Is 32 GB physical memory enough, or should hope for some good black friday memory deals? 2: Easy to follow instructions to get enough virtualized Linux up-and-running. 3: Easy to follow instructions to get BOINC + CPDN up-and-running and a method to monitor progress. Point 3 does not include reading the ... 386 current posts in the "Running 32bit CPDN from 64bit Linux - Discussion" thread. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
Second problem is, for everyone like me that is not a sudo masochist, with the either confusing or outdated documentation just getting BOINC itself up-and-running is complicated. Add to the mix, is it still required to screw-around with installing 32-bit libraries?Not for these tasks but it is for the Hadley models such as Hadam4 etc. 3: Easy to follow instructions to get BOINC + CPDN up-and-running and a method to monitor progress.I have found the latest instructions at least for my distrubution worked well copying and pasting into a terminal. I get though that it still isn't as simple as Windows if you want the most up to date version. Because the old Met Office based models are still in use, when I do a new installation of BOINC I still refer to the instructions in that thread or the other one that just contains the commands for different distributions rather than any discussion. (Despite having written the post myself!) |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Thanks for the positive vibes and the discussion. Very helpful. Answering all the points raised: Don't you create a new checkpoint file before deleting the old checkpoint file ?Correct. I'm glad someone is paying attention! Yes, for a brief time there are two sets of checkpoint files; the old ones are deleted once the new ones are written. So peak disk usage is ~8Gb not 4Gb as I originally said. ... as long as it's not trying to re-write 4 GB every 10 seconds or something similarly ridiculous.No it won't. It would slow the model down too much for a start. I'll do some testing but aim for around 1hr computing per checkpoint. is it possible to virtualize Linux on top of Windows if you've only got 32 GB memory &32Gb is enough for OpenIFS@60 on Linux. For Windows+Linux my vote would be to use WSL. I found it easier & faster than something like VBox. But managing the memory between Windows & Linux has to be done carefully. If the model is peaking at 24Gb that would mean a 26Gb limit for WSL which doesn't leave enough for Windows? installing 32-bit libraries?OpenIFS is 64bit and doesn't need them. these will come out before the upcoming Hadley ones or the other way around?The HadAM4 batch(es) will come first as I'm working on HadAM4 now, then OpenIFS. I'd also like the ability to use more cores. In addition, perhaps allow users to choose to run 2-3 tasks at a time, if they have the RAMYes, me too. However, to start with we'll stick with 2 cores & 1 task per host to see how everyone finds it. Later on we can look at relaxing things; maybe using app_config.xml to override the default # cores. The only problem I had with OIFS models in the past was the internet bandwidth for uploading.Agreed. This is part of the rationale for restricting the number of tasks per host. CPDN is well aware not everyone has high speed internet access. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Credit. What's the thinking on granting credit for these more resource hungry tasks? My thoughts are: (a) volunteers should be suitably rewarded for additional resources donated; (b) take the credit granted for the OpenIFS tasks run so far (the 125km grid resolution) as a base, work out a scaling factor which also takes into account extra memory & disk required, as well as the additional computation? Though it's not clear exactly how to do this in the 'spirit' of boinc credit. Thoughts? --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 126 Credit: 24,426,020 RAC: 23,705 |
OpenIFS is 64bit and doesn't need them. The HadAM4 batch(es) will come first as I'm working on HadAM4 now, then OpenIFS. Taking these two quotes together and coupled with CPDN not having any preferences for blocking HadAM4 work, chances are you'll need 32-bit to not trash any unexpected HadAM4 resends. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,662,895 RAC: 61,039 |
Thanks for making it opt-in. It's actually very exciting to have some more demanding workloads IMO. My suggestions is that for whatever mechanism we use for the opt-in, make the requirement clear at point of opt-in. However, to start with we'll stick with 2 cores & 1 task per host to see how everyone finds it. Later on we can look at relaxing things; maybe using app_config.xml to override the default # cores. I'm curious if you have estimate of how many hosts would be eligible. Though it's not my problem to worry, I feel a large portion of hosts would be excluded and could make the research progress really slow. It would be helpful to allow big hosts to run more than one task if the eligible hosts are small. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I'm curious if you have estimate of how many hosts would be eligible.Yes, we checked the database. There are ~600 linux hosts with 32+ GB RAM. Enough to make it workable. --- CPDN Visiting Scientist |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,662,895 RAC: 61,039 |
Peak space requirements perhaps when the uploads are not working, then the number of model output starts increasing but there's still only ever 1 checkpoint. Back of envelope, let's say ~20 uploads of model output waiting to transfer, roughly 1.75Gb extra on top of the checkpoint 4Gb? It's not huge and if I add compression I should be able to get the checkpoint file down to less than 3Gb. Wear & tear on the storage might be a concern? (personally it's not but I want to raise it). I'm curious if you have a similar calculation for previous OpenIFS tasks? Back then, I monitored on multiple hosts and they all turned out to be around 50GB host write to SSD per WU for previous OpenIFS tasks. If that doesn't match your calculation for previous tasks, we may have some unaccounted writes. Your calculation seems to indicate the overall writes would actually be smaller, likely due to the much longer checkpoint even though it's bigger. I doubt the SSD wear would be a concern unless we have a whole lot of WU to run for years. I'm more interested in the IO burst that might be a problem for HDD and cloud, especially if we start allowing multiple WUs per host. They'd better not time out the writes... |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I doubt the SSD wear would be a concern unless we have a whole lot of WU to run for years. I'm more interested in the IO burst that might be a problem for HDD and cloud, I have an HDD partition reserved for Boinc. It has two other partitions; one for videos and one for audio files. These are seldom used. So the main seeking on that drive are short, within the Boinc partition. At the moment, with no CPDN tasks running, but six WCG, four Rosetta, and two Einstein running, 524 GB — 515 GB free (1.8% full) Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda3 511750000 9097804 502652196 2% /var/lib/boinc I do not remember noticing problems when running OpenIFS problems in the past. And, IIRC, I sometimes ran four at a time while running other projects too. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
You will definitely notice the impact of this configuration running on your machine if you are using it. Let us say I am running one of those. And it is using 4 cores @ 98%. Why would that be any different from running 4 WCG tasks, for example? It would use only 4 cores, leaving me with 12 others for whatever. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
Let us say I am running one of those. And it is using 4 cores @ 98%. Why would that be any different from running 4 WCG tasks, for example? It would use only 4 cores, leaving me with 12 others for whatever.I doubt there will be any problem using most of the spare cores for WCG. However should they get the ARP tasks up and running the way the cache memory gets hammered they might get slowed down a bit. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,816,935 RAC: 19,934 |
is it possible to virtualize Linux on top of Windows if you've only got 32 GB memory &32Gb is enough for OpenIFS@60 on Linux. For Windows+Linux my vote would be to use WSL. I found it easier & faster than something like VBox. But managing the memory between Windows & Linux has to be done carefully. If the model is peaking at 24Gb that would mean a 26Gb limit for WSL which doesn't leave enough for Windows? Agreed. WSL2 is the least resource demanding way of running Linux on Windows. I believe recent versions of WSL2 have memory reclamation feature. So even if you configure WSL2 to run with 26Gb of RAM, it won't hold on to it the entire time it's on and will release unused RAM to Windows. So I don't think management has to be that careful. Glenn, have you had a chance to test this multi-core OIFS app on WSL2? I think the only multi-core BOINC app I've run on WSL2 is LHC Atlas and I've never been able to get it to run multi-core and had to resort to running it single core. I don't know if it's an Atlas issue or not but it's something that came to mind as I don't have a Linux PC and use WSL2 for all BOINC Linux work. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I believe recent versions of WSL2 have memory reclamation feature. So even if you configure WSL2 to run with 26Gb of RAM, it won't hold on to it the entire time it's on and will release unused RAM to Windows. So I don't think management has to be that careful.That's true it does but I've crashed things before by not paying attention to how much memory WSL2 & Windows apps were using as I tend to oversubscribe the WSL2 memory. The model is going to hit peak memory every timestep so best to assume WSL2 needs 26+Gb of RAM continually if running this app. Glenn, have you had a chance to test this multi-core OIFS app on WSL2? I think the only multi-core BOINC app I've run on WSL2 is LHC Atlas and I've never been able to get it to run multi-core and had to resort to running it single core.Did you set e.g. 'processors=4' in the .wslconfig ? I've not tried it. I'll let you know. Let us say I am running one of those. And it is using 4 cores @ 98%. Why would that be any different from running 4 WCG tasks, for example? It would use only 4 cores, leaving me with 12 others for whatever.The issue is not just cores, it's memory. WCG tasks use little memory. This configuration of OpenIFS will be actively using 26Gb, that's a lot of data to move around if other apps need it. It's also important that the option 'leave non-GPU tasks in memory while suspended' is selected for these tasks otherwise the model will be forced to restart from checkpoint files any time it gets suspended. I'm curious if you have a similar calculation for previous OpenIFS tasks? Back then, I monitored on multiple hosts and they all turned out to be around 50GB host write to SSD per WU for previous OpenIFS tasks.No, not without running one to see. I can only give ball-park numbers at the moment as we haven't worked out how much data output the scientist wants and the optimum checkpoint frequency. I'm more interested in the IO burst that might be a problem for HDD and cloud..Me too. That depends on individual hardware so I can't comment much. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Standalone Linux demo to try OpenIFS@60 I've created a standalone demo of the 60km configuration of OpenIFS which anyone can download and run for themselves. It does not run under BOINC, it's just the model process part of what the client would normally download. I thought it might be useful if people want to see what it does. You need to be familiar with running shell scripts under linux. Do not attempt to run this on a machine with less than 32Gb memory. This application will require a peak memory of ~25Gb. The link is: https://www.dropbox.com/scl/fi/xl7dcw0yqo1o159leemgk/oifs_demo.tgz?rlkey=xu6reo085bll8n2h0q1uwubvi&st=alr7zr65&dl=0 (please copy & paste into browser) 1/ Download to a folder where you intend to run it. 2/ Unpack the file with: tar xf oifs_demo.tgzThis will create a directory 'oifs319_demo'. The unpacked filesize is 650Mb. 3/ Change into directory. 4/ The model is configured to run with 2 threads. To run the model just do: ./run_oifs this will generate model output to the screen. You could use 'top' to verify the cpu percent usage is 200%. To kill the model use CTRL-C. If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS' but I suggest keeping to 4 or less as the parallel efficiency drops off markedly beyond 4 threads. Note the threaded application is not statically linked and requires some dynamically loaded libraries. The demo is configured to run for 2 model days and output checkpoint files every 12 model hours. To clean up the model output and go back to the initial files, run the script: ./clean If anyone has problems getting this to run, please let me know. Hope this is useful. I will keep this there for a week or so. The file is 330Mb. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS' Top varies between 350 and 400 so far. I checked the line and it is 4 in the downloaded file. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Ok, my mistake. I forgot to reset it to '2'.If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS'Top varies between 350 and 400 so far. I checked the line and it is 4 in the downloaded file. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
Takes about 35 minutes to complete on my Ryzen9. Edit: Ubuntu 24.04. No BOINC tasks running at the same time, 64GB RAM |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The issue is not just cores, it's memory. WCG tasks use little memory. This configuration of OpenIFS will be actively using 26Gb, that's a lot of data to move around if other apps need it. It's also important that the option 'leave non-GPU tasks in memory while suspended' is selected for these tasks otherwise the model will be forced to restart from checkpoint files any time it gets sus For me, on my Linux machine, my biggest tasks are the Rosetta ones: 2.2 to 2.5 GBytes of RAM required. I have no CPDN tasks running. The WCG ones are all MCM1. top - 13:17:53 up 13 days, 1:27, 2 users, load average: 12.33, 12.56, 13.18 Tasks: 486 total, 13 running, 473 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.1 us, 0.4 sy, 74.6 ni, 23.6 id, 0.0 wa, 0.2 hi, 0.1 si, 0.0 st MiB Mem : 128086.0 total, 2546.9 free, 15399.7 used, 110139.4 buff/cache MiB Swap: 15992.0 total, 15700.5 free, 291.5 used. 109343.6 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 2516393 2086 boinc 39 19 R 2.5g 2.0 99.1 3 383:27.90 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ 2527376 2086 boinc 39 19 R 2.4g 2.0 99.3 12 291:01.71 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ 2530806 2086 boinc 39 19 R 2.4g 1.9 99.2 0 261:36.43 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ 2582689 2086 boinc 39 19 R 2.2g 1.8 99.0 9 65:17.35 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ ... 2584661 2086 boinc 39 19 R 39824 0.0 99.4 2 61:48.06 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ ... |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
On a 5800X3D running Ubuntu 24.04, the 4 core config ran in 40 minutes, and the 2 core config in 72.minutes. I tried running it on a Linux Mint 20.3 PC, fully updated, and it gave the following errors about glibc versions ++ ./oifs_43r3_omp_model.exe ./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./oifs_43r3_omp_model.exe) ./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by ./oifs_43r3_omp_model.exe) ./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./oifs_43r3_omp_model.exe) glibc version on this PC is 2.31 Anyone running Ubuntu 20.04 LTS or anything based on this or earlier, won't have proper glibc versions to run it. Probably also Ubuntu 22.04 LTS will have a problem, but I don't have an installation of that to test it. |
©2024 cpdn.org