Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested

Author	Message
AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,791,235 RAC: 19,552	Message 71616 - Posted: 15 Oct 2024, 7:58:32 UTC Glenn, do you think these will come out before the upcoming Hadley ones or the other way around? Just curious. ID: 71616 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 126 Credit: 24,398,685 RAC: 23,843	Message 71619 - Posted: 15 Oct 2024, 14:06:26 UTC 4 GB checkpoint shouldn't be a problem, as long as it's not trying to re-write 4 GB every 10 seconds or something similarly ridiculous. As for 25 GB peak memory usage, first problem is, is it possible to virtualize Linux on top of Windows if you've only got 32 GB memory, or do you need at least 48 GB? Second problem is, for everyone like me that is not a sudo masochist, with the either confusing or outdated documentation just getting BOINC itself up-and-running is complicated. Add to the mix, is it still required to screw-around with installing 32-bit libraries? For running these large OpenIFS models it really boils down to: 1: Is 32 GB physical memory enough, or should hope for some good black friday memory deals? 2: Easy to follow instructions to get enough virtualized Linux up-and-running. 3: Easy to follow instructions to get BOINC + CPDN up-and-running and a method to monitor progress. Point 3 does not include reading the ... 386 current posts in the "Running 32bit CPDN from 64bit Linux - Discussion" thread. ID: 71619 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,979,167 RAC: 21,830	Message 71621 - Posted: 15 Oct 2024, 16:44:53 UTC Second problem is, for everyone like me that is not a sudo masochist, with the either confusing or outdated documentation just getting BOINC itself up-and-running is complicated. Add to the mix, is it still required to screw-around with installing 32-bit libraries? Not for these tasks but it is for the Hadley models such as Hadam4 etc. 3: Easy to follow instructions to get BOINC + CPDN up-and-running and a method to monitor progress. I have found the latest instructions at least for my distrubution worked well copying and pasting into a terminal. I get though that it still isn't as simple as Windows if you want the most up to date version. Because the old Met Office based models are still in use, when I do a new installation of BOINC I still refer to the instructions in that thread or the other one that just contains the commands for different distributions rather than any discussion. (Despite having written the post myself!) ID: 71621 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71623 - Posted: 15 Oct 2024, 20:03:58 UTC Thanks for the positive vibes and the discussion. Very helpful. Answering all the points raised: Don't you create a new checkpoint file before deleting the old checkpoint file ? Correct. I'm glad someone is paying attention! Yes, for a brief time there are two sets of checkpoint files; the old ones are deleted once the new ones are written. So peak disk usage is ~8Gb not 4Gb as I originally said. ... as long as it's not trying to re-write 4 GB every 10 seconds or something similarly ridiculous. No it won't. It would slow the model down too much for a start. I'll do some testing but aim for around 1hr computing per checkpoint. is it possible to virtualize Linux on top of Windows if you've only got 32 GB memory & Is 32 GB physical memory enough 32Gb is enough for OpenIFS@60 on Linux. For Windows+Linux my vote would be to use WSL. I found it easier & faster than something like VBox. But managing the memory between Windows & Linux has to be done carefully. If the model is peaking at 24Gb that would mean a 26Gb limit for WSL which doesn't leave enough for Windows? installing 32-bit libraries? OpenIFS is 64bit and doesn't need them. these will come out before the upcoming Hadley ones or the other way around? The HadAM4 batch(es) will come first as I'm working on HadAM4 now, then OpenIFS. I'd also like the ability to use more cores. In addition, perhaps allow users to choose to run 2-3 tasks at a time, if they have the RAM Yes, me too. However, to start with we'll stick with 2 cores & 1 task per host to see how everyone finds it. Later on we can look at relaxing things; maybe using app_config.xml to override the default # cores. The only problem I had with OIFS models in the past was the internet bandwidth for uploading. Agreed. This is part of the rationale for restricting the number of tasks per host. CPDN is well aware not everyone has high speed internet access. --- CPDN Visiting Scientist ID: 71623 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71624 - Posted: 15 Oct 2024, 20:10:34 UTC Credit. What's the thinking on granting credit for these more resource hungry tasks? My thoughts are: (a) volunteers should be suitably rewarded for additional resources donated; (b) take the credit granted for the OpenIFS tasks run so far (the 125km grid resolution) as a base, work out a scaling factor which also takes into account extra memory & disk required, as well as the additional computation? Though it's not clear exactly how to do this in the 'spirit' of boinc credit. Thoughts? --- CPDN Visiting Scientist ID: 71624 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 126 Credit: 24,398,685 RAC: 23,843	Message 71625 - Posted: 15 Oct 2024, 20:38:45 UTC - in response to Message 71623. OpenIFS is 64bit and doesn't need them. The HadAM4 batch(es) will come first as I'm working on HadAM4 now, then OpenIFS. Taking these two quotes together and coupled with CPDN not having any preferences for blocking HadAM4 work, chances are you'll need 32-bit to not trash any unexpected HadAM4 resends. ID: 71625 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,574,898 RAC: 59,036	Message 71626 - Posted: 15 Oct 2024, 21:28:03 UTC Thanks for making it opt-in. It's actually very exciting to have some more demanding workloads IMO. My suggestions is that for whatever mechanism we use for the opt-in, make the requirement clear at point of opt-in. However, to start with we'll stick with 2 cores & 1 task per host to see how everyone finds it. Later on we can look at relaxing things; maybe using app_config.xml to override the default # cores. I'm curious if you have estimate of how many hosts would be eligible. Though it's not my problem to worry, I feel a large portion of hosts would be excluded and could make the research progress really slow. It would be helpful to allow big hosts to run more than one task if the eligible hosts are small. ID: 71626 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71627 - Posted: 15 Oct 2024, 22:15:43 UTC - in response to Message 71626. I'm curious if you have estimate of how many hosts would be eligible. Yes, we checked the database. There are ~600 linux hosts with 32+ GB RAM. Enough to make it workable. --- CPDN Visiting Scientist ID: 71627 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,574,898 RAC: 59,036	Message 71628 - Posted: 15 Oct 2024, 23:45:11 UTC - in response to Message 71607. Last modified: 15 Oct 2024, 23:45:46 UTC Peak space requirements perhaps when the uploads are not working, then the number of model output starts increasing but there's still only ever 1 checkpoint. Back of envelope, let's say ~20 uploads of model output waiting to transfer, roughly 1.75Gb extra on top of the checkpoint 4Gb? It's not huge and if I add compression I should be able to get the checkpoint file down to less than 3Gb. Wear & tear on the storage might be a concern? (personally it's not but I want to raise it). I'm curious if you have a similar calculation for previous OpenIFS tasks? Back then, I monitored on multiple hosts and they all turned out to be around 50GB host write to SSD per WU for previous OpenIFS tasks. If that doesn't match your calculation for previous tasks, we may have some unaccounted writes. Your calculation seems to indicate the overall writes would actually be smaller, likely due to the much longer checkpoint even though it's bigger. I doubt the SSD wear would be a concern unless we have a whole lot of WU to run for years. I'm more interested in the IO burst that might be a problem for HDD and cloud, especially if we start allowing multiple WUs per host. They'd better not time out the writes... ID: 71628 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 71629 - Posted: 16 Oct 2024, 2:38:38 UTC - in response to Message 71628. Last modified: 16 Oct 2024, 2:40:34 UTC I doubt the SSD wear would be a concern unless we have a whole lot of WU to run for years. I'm more interested in the IO burst that might be a problem for HDD and cloud, I have an HDD partition reserved for Boinc. It has two other partitions; one for videos and one for audio files. These are seldom used. So the main seeking on that drive are short, within the Boinc partition. At the moment, with no CPDN tasks running, but six WCG, four Rosetta, and two Einstein running, 524 GB — 515 GB free (1.8% full) Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda3 511750000 9097804 502652196 2% /var/lib/boinc I do not remember noticing problems when running OpenIFS problems in the past. And, IIRC, I sometimes ran four at a time while running other projects too. ID: 71629 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 71631 - Posted: 16 Oct 2024, 4:19:57 UTC - in response to Message 71607. You will definitely notice the impact of this configuration running on your machine if you are using it. Let us say I am running one of those. And it is using 4 cores @ 98%. Why would that be any different from running 4 WCG tasks, for example? It would use only 4 cores, leaving me with 12 others for whatever. ID: 71631 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,979,167 RAC: 21,830	Message 71633 - Posted: 16 Oct 2024, 6:50:29 UTC - in response to Message 71631. Let us say I am running one of those. And it is using 4 cores @ 98%. Why would that be any different from running 4 WCG tasks, for example? It would use only 4 cores, leaving me with 12 others for whatever. I doubt there will be any problem using most of the spare cores for WCG. However should they get the ARP tasks up and running the way the cache memory gets hammered they might get slowed down a bit. ID: 71633 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,791,235 RAC: 19,552	Message 71634 - Posted: 16 Oct 2024, 7:04:29 UTC - in response to Message 71623. is it possible to virtualize Linux on top of Windows if you've only got 32 GB memory & Is 32 GB physical memory enough 32Gb is enough for OpenIFS@60 on Linux. For Windows+Linux my vote would be to use WSL. I found it easier & faster than something like VBox. But managing the memory between Windows & Linux has to be done carefully. If the model is peaking at 24Gb that would mean a 26Gb limit for WSL which doesn't leave enough for Windows? Agreed. WSL2 is the least resource demanding way of running Linux on Windows. I believe recent versions of WSL2 have memory reclamation feature. So even if you configure WSL2 to run with 26Gb of RAM, it won't hold on to it the entire time it's on and will release unused RAM to Windows. So I don't think management has to be that careful. Glenn, have you had a chance to test this multi-core OIFS app on WSL2? I think the only multi-core BOINC app I've run on WSL2 is LHC Atlas and I've never been able to get it to run multi-core and had to resort to running it single core. I don't know if it's an Atlas issue or not but it's something that came to mind as I don't have a Linux PC and use WSL2 for all BOINC Linux work. ID: 71634 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71636 - Posted: 16 Oct 2024, 10:39:13 UTC - in response to Message 71634. I believe recent versions of WSL2 have memory reclamation feature. So even if you configure WSL2 to run with 26Gb of RAM, it won't hold on to it the entire time it's on and will release unused RAM to Windows. So I don't think management has to be that careful. That's true it does but I've crashed things before by not paying attention to how much memory WSL2 & Windows apps were using as I tend to oversubscribe the WSL2 memory. The model is going to hit peak memory every timestep so best to assume WSL2 needs 26+Gb of RAM continually if running this app. Glenn, have you had a chance to test this multi-core OIFS app on WSL2? I think the only multi-core BOINC app I've run on WSL2 is LHC Atlas and I've never been able to get it to run multi-core and had to resort to running it single core. Did you set e.g. 'processors=4' in the .wslconfig ? I've not tried it. I'll let you know. Let us say I am running one of those. And it is using 4 cores @ 98%. Why would that be any different from running 4 WCG tasks, for example? It would use only 4 cores, leaving me with 12 others for whatever. The issue is not just cores, it's memory. WCG tasks use little memory. This configuration of OpenIFS will be actively using 26Gb, that's a lot of data to move around if other apps need it. It's also important that the option 'leave non-GPU tasks in memory while suspended' is selected for these tasks otherwise the model will be forced to restart from checkpoint files any time it gets suspended. I'm curious if you have a similar calculation for previous OpenIFS tasks? Back then, I monitored on multiple hosts and they all turned out to be around 50GB host write to SSD per WU for previous OpenIFS tasks. No, not without running one to see. I can only give ball-park numbers at the moment as we haven't worked out how much data output the scientist wants and the optimum checkpoint frequency. I'm more interested in the IO burst that might be a problem for HDD and cloud.. Me too. That depends on individual hardware so I can't comment much. --- CPDN Visiting Scientist ID: 71636 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71638 - Posted: 16 Oct 2024, 13:31:30 UTC Last modified: 16 Oct 2024, 13:35:49 UTC Standalone Linux demo to try OpenIFS@60 I've created a standalone demo of the 60km configuration of OpenIFS which anyone can download and run for themselves. It does not run under BOINC, it's just the model process part of what the client would normally download. I thought it might be useful if people want to see what it does. You need to be familiar with running shell scripts under linux. Do not attempt to run this on a machine with less than 32Gb memory. This application will require a peak memory of ~25Gb. The link is: https://www.dropbox.com/scl/fi/xl7dcw0yqo1o159leemgk/oifs_demo.tgz?rlkey=xu6reo085bll8n2h0q1uwubvi&st=alr7zr65&dl=0 (please copy & paste into browser) 1/ Download to a folder where you intend to run it. 2/ Unpack the file with: tar xf oifs_demo.tgz This will create a directory 'oifs319_demo'. The unpacked filesize is 650Mb. 3/ Change into directory. 4/ The model is configured to run with 2 threads. To run the model just do: ./run_oifs this will generate model output to the screen. You could use 'top' to verify the cpu percent usage is 200%. To kill the model use CTRL-C. If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS' but I suggest keeping to 4 or less as the parallel efficiency drops off markedly beyond 4 threads. Note the threaded application is not statically linked and requires some dynamically loaded libraries. The demo is configured to run for 2 model days and output checkpoint files every 12 model hours. To clean up the model output and go back to the initial files, run the script: ./clean If anyone has problems getting this to run, please let me know. Hope this is useful. I will keep this there for a week or so. The file is 330Mb. --- CPDN Visiting Scientist ID: 71638 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,979,167 RAC: 21,830	Message 71639 - Posted: 16 Oct 2024, 15:06:51 UTC If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS' Top varies between 350 and 400 so far. I checked the line and it is 4 in the downloaded file. ID: 71639 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 71640 - Posted: 16 Oct 2024, 15:20:39 UTC - in response to Message 71639. Last modified: 16 Oct 2024, 15:20:50 UTC If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS' Top varies between 350 and 400 so far. I checked the line and it is 4 in the downloaded file. Ok, my mistake. I forgot to reset it to '2'. ID: 71640 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,979,167 RAC: 21,830	Message 71641 - Posted: 16 Oct 2024, 15:54:52 UTC - in response to Message 71640. Last modified: 16 Oct 2024, 20:02:30 UTC Takes about 35 minutes to complete on my Ryzen9. Edit: Ubuntu 24.04. No BOINC tasks running at the same time, 64GB RAM ID: 71641 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 71643 - Posted: 16 Oct 2024, 17:34:48 UTC - in response to Message 71636. The issue is not just cores, it's memory. WCG tasks use little memory. This configuration of OpenIFS will be actively using 26Gb, that's a lot of data to move around if other apps need it. It's also important that the option 'leave non-GPU tasks in memory while suspended' is selected for these tasks otherwise the model will be forced to restart from checkpoint files any time it gets sus For me, on my Linux machine, my biggest tasks are the Rosetta ones: 2.2 to 2.5 GBytes of RAM required. I have no CPDN tasks running. The WCG ones are all MCM1. top - 13:17:53 up 13 days, 1:27, 2 users, load average: 12.33, 12.56, 13.18 Tasks: 486 total, 13 running, 473 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.1 us, 0.4 sy, 74.6 ni, 23.6 id, 0.0 wa, 0.2 hi, 0.1 si, 0.0 st MiB Mem : 128086.0 total, 2546.9 free, 15399.7 used, 110139.4 buff/cache MiB Swap: 15992.0 total, 15700.5 free, 291.5 used. 109343.6 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 2516393 2086 boinc 39 19 R 2.5g 2.0 99.1 3 383:27.90 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ 2527376 2086 boinc 39 19 R 2.4g 2.0 99.3 12 291:01.71 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ 2530806 2086 boinc 39 19 R 2.4g 1.9 99.2 0 261:36.43 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ 2582689 2086 boinc 39 19 R 2.2g 1.8 99.0 9 65:17.35 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ ... 2584661 2086 boinc 39 19 R 39824 0.0 99.4 2 61:48.06 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ ... ID: 71643 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 71644 - Posted: 16 Oct 2024, 19:00:37 UTC On a 5800X3D running Ubuntu 24.04, the 4 core config ran in 40 minutes, and the 2 core config in 72.minutes. I tried running it on a Linux Mint 20.3 PC, fully updated, and it gave the following errors about glibc versions ++ ./oifs_43r3_omp_model.exe ./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./oifs_43r3_omp_model.exe) ./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by ./oifs_43r3_omp_model.exe) ./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./oifs_43r3_omp_model.exe) glibc version on this PC is 2.31 Anyone running Ubuntu 20.04 LTS or anything based on this or earlier, won't have proper glibc versions to run it. Probably also Ubuntu 22.04 LTS will have a problem, but I don't have an installation of that to test it. ID: 71644 · Reply Quote