climateprediction.net (CPDN) home page
Thread 'Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested'

Thread 'Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested'

Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 71616 - Posted: 15 Oct 2024, 7:58:32 UTC

Glenn, do you think these will come out before the upcoming Hadley ones or the other way around? Just curious.
ID: 71616 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 126
Credit: 24,426,020
RAC: 23,705
Message 71619 - Posted: 15 Oct 2024, 14:06:26 UTC

4 GB checkpoint shouldn't be a problem, as long as it's not trying to re-write 4 GB every 10 seconds or something similarly ridiculous.

As for 25 GB peak memory usage, first problem is, is it possible to virtualize Linux on top of Windows if you've only got 32 GB memory, or do you need at least 48 GB?

Second problem is, for everyone like me that is not a sudo masochist, with the either confusing or outdated documentation just getting BOINC itself up-and-running is complicated. Add to the mix, is it still required to screw-around with installing 32-bit libraries?

For running these large OpenIFS models it really boils down to:
1: Is 32 GB physical memory enough, or should hope for some good black friday memory deals?
2: Easy to follow instructions to get enough virtualized Linux up-and-running.
3: Easy to follow instructions to get BOINC + CPDN up-and-running and a method to monitor progress.

Point 3 does not include reading the ... 386 current posts in the "Running 32bit CPDN from 64bit Linux - Discussion" thread.
ID: 71619 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 71621 - Posted: 15 Oct 2024, 16:44:53 UTC

Second problem is, for everyone like me that is not a sudo masochist, with the either confusing or outdated documentation just getting BOINC itself up-and-running is complicated. Add to the mix, is it still required to screw-around with installing 32-bit libraries?
Not for these tasks but it is for the Hadley models such as Hadam4 etc.

3: Easy to follow instructions to get BOINC + CPDN up-and-running and a method to monitor progress.
I have found the latest instructions at least for my distrubution worked well copying and pasting into a terminal. I get though that it still isn't as simple as Windows if you want the most up to date version.

Because the old Met Office based models are still in use, when I do a new installation of BOINC I still refer to the instructions in that thread or the other one that just contains the commands for different distributions rather than any discussion. (Despite having written the post myself!)
ID: 71621 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71623 - Posted: 15 Oct 2024, 20:03:58 UTC

Thanks for the positive vibes and the discussion. Very helpful.
Answering all the points raised:

Don't you create a new checkpoint file before deleting the old checkpoint file ?
Correct. I'm glad someone is paying attention! Yes, for a brief time there are two sets of checkpoint files; the old ones are deleted once the new ones are written. So peak disk usage is ~8Gb not 4Gb as I originally said.

... as long as it's not trying to re-write 4 GB every 10 seconds or something similarly ridiculous.
No it won't. It would slow the model down too much for a start. I'll do some testing but aim for around 1hr computing per checkpoint.

is it possible to virtualize Linux on top of Windows if you've only got 32 GB memory &
Is 32 GB physical memory enough
32Gb is enough for OpenIFS@60 on Linux. For Windows+Linux my vote would be to use WSL. I found it easier & faster than something like VBox. But managing the memory between Windows & Linux has to be done carefully. If the model is peaking at 24Gb that would mean a 26Gb limit for WSL which doesn't leave enough for Windows?

installing 32-bit libraries?
OpenIFS is 64bit and doesn't need them.

these will come out before the upcoming Hadley ones or the other way around?
The HadAM4 batch(es) will come first as I'm working on HadAM4 now, then OpenIFS.

I'd also like the ability to use more cores. In addition, perhaps allow users to choose to run 2-3 tasks at a time, if they have the RAM
Yes, me too. However, to start with we'll stick with 2 cores & 1 task per host to see how everyone finds it. Later on we can look at relaxing things; maybe using app_config.xml to override the default # cores.

The only problem I had with OIFS models in the past was the internet bandwidth for uploading.
Agreed. This is part of the rationale for restricting the number of tasks per host. CPDN is well aware not everyone has high speed internet access.
---
CPDN Visiting Scientist
ID: 71623 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71624 - Posted: 15 Oct 2024, 20:10:34 UTC

Credit.
What's the thinking on granting credit for these more resource hungry tasks? My thoughts are: (a) volunteers should be suitably rewarded for additional resources donated; (b) take the credit granted for the OpenIFS tasks run so far (the 125km grid resolution) as a base, work out a scaling factor which also takes into account extra memory & disk required, as well as the additional computation? Though it's not clear exactly how to do this in the 'spirit' of boinc credit. Thoughts?
---
CPDN Visiting Scientist
ID: 71624 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 126
Credit: 24,426,020
RAC: 23,705
Message 71625 - Posted: 15 Oct 2024, 20:38:45 UTC - in response to Message 71623.  

OpenIFS is 64bit and doesn't need them.

The HadAM4 batch(es) will come first as I'm working on HadAM4 now, then OpenIFS.

Taking these two quotes together and coupled with CPDN not having any preferences for blocking HadAM4 work, chances are you'll need 32-bit to not trash any unexpected HadAM4 resends.
ID: 71625 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,662,895
RAC: 61,039
Message 71626 - Posted: 15 Oct 2024, 21:28:03 UTC

Thanks for making it opt-in. It's actually very exciting to have some more demanding workloads IMO. My suggestions is that for whatever mechanism we use for the opt-in, make the requirement clear at point of opt-in.

However, to start with we'll stick with 2 cores & 1 task per host to see how everyone finds it. Later on we can look at relaxing things; maybe using app_config.xml to override the default # cores.

I'm curious if you have estimate of how many hosts would be eligible. Though it's not my problem to worry, I feel a large portion of hosts would be excluded and could make the research progress really slow. It would be helpful to allow big hosts to run more than one task if the eligible hosts are small.
ID: 71626 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71627 - Posted: 15 Oct 2024, 22:15:43 UTC - in response to Message 71626.  

I'm curious if you have estimate of how many hosts would be eligible.
Yes, we checked the database. There are ~600 linux hosts with 32+ GB RAM. Enough to make it workable.
---
CPDN Visiting Scientist
ID: 71627 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,662,895
RAC: 61,039
Message 71628 - Posted: 15 Oct 2024, 23:45:11 UTC - in response to Message 71607.  
Last modified: 15 Oct 2024, 23:45:46 UTC

Peak space requirements perhaps when the uploads are not working, then the number of model output starts increasing but there's still only ever 1 checkpoint. Back of envelope, let's say ~20 uploads of model output waiting to transfer, roughly 1.75Gb extra on top of the checkpoint 4Gb? It's not huge and if I add compression I should be able to get the checkpoint file down to less than 3Gb. Wear & tear on the storage might be a concern? (personally it's not but I want to raise it).

I'm curious if you have a similar calculation for previous OpenIFS tasks? Back then, I monitored on multiple hosts and they all turned out to be around 50GB host write to SSD per WU for previous OpenIFS tasks. If that doesn't match your calculation for previous tasks, we may have some unaccounted writes. Your calculation seems to indicate the overall writes would actually be smaller, likely due to the much longer checkpoint even though it's bigger.

I doubt the SSD wear would be a concern unless we have a whole lot of WU to run for years. I'm more interested in the IO burst that might be a problem for HDD and cloud, especially if we start allowing multiple WUs per host. They'd better not time out the writes...
ID: 71628 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71629 - Posted: 16 Oct 2024, 2:38:38 UTC - in response to Message 71628.  
Last modified: 16 Oct 2024, 2:40:34 UTC

I doubt the SSD wear would be a concern unless we have a whole lot of WU to run for years. I'm more interested in the IO burst that might be a problem for HDD and cloud,


I have an HDD partition reserved for Boinc. It has two other partitions; one for videos and one for audio files. These are seldom used. So the main seeking on that drive are short, within the Boinc partition. At the moment, with no CPDN tasks running, but six WCG, four Rosetta, and two Einstein running,

524 GB — 515 GB free (1.8% full)

Filesystem            1K-blocks      Used Available Use% Mounted on
/dev/sda3             511750000   9097804 502652196   2% /var/lib/boinc


I do not remember noticing problems when running OpenIFS problems in the past. And, IIRC, I sometimes ran four at a time while running other projects too.
ID: 71629 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71631 - Posted: 16 Oct 2024, 4:19:57 UTC - in response to Message 71607.  

You will definitely notice the impact of this configuration running on your machine if you are using it.


Let us say I am running one of those. And it is using 4 cores @ 98%. Why would that be any different from running 4 WCG tasks, for example? It would use only 4 cores, leaving me with 12 others for whatever.
ID: 71631 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 71633 - Posted: 16 Oct 2024, 6:50:29 UTC - in response to Message 71631.  

Let us say I am running one of those. And it is using 4 cores @ 98%. Why would that be any different from running 4 WCG tasks, for example? It would use only 4 cores, leaving me with 12 others for whatever.
I doubt there will be any problem using most of the spare cores for WCG. However should they get the ARP tasks up and running the way the cache memory gets hammered they might get slowed down a bit.
ID: 71633 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 71634 - Posted: 16 Oct 2024, 7:04:29 UTC - in response to Message 71623.  

is it possible to virtualize Linux on top of Windows if you've only got 32 GB memory &
Is 32 GB physical memory enough
32Gb is enough for OpenIFS@60 on Linux. For Windows+Linux my vote would be to use WSL. I found it easier & faster than something like VBox. But managing the memory between Windows & Linux has to be done carefully. If the model is peaking at 24Gb that would mean a 26Gb limit for WSL which doesn't leave enough for Windows?

Agreed. WSL2 is the least resource demanding way of running Linux on Windows.

I believe recent versions of WSL2 have memory reclamation feature. So even if you configure WSL2 to run with 26Gb of RAM, it won't hold on to it the entire time it's on and will release unused RAM to Windows. So I don't think management has to be that careful.

Glenn, have you had a chance to test this multi-core OIFS app on WSL2? I think the only multi-core BOINC app I've run on WSL2 is LHC Atlas and I've never been able to get it to run multi-core and had to resort to running it single core. I don't know if it's an Atlas issue or not but it's something that came to mind as I don't have a Linux PC and use WSL2 for all BOINC Linux work.
ID: 71634 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71636 - Posted: 16 Oct 2024, 10:39:13 UTC - in response to Message 71634.  

I believe recent versions of WSL2 have memory reclamation feature. So even if you configure WSL2 to run with 26Gb of RAM, it won't hold on to it the entire time it's on and will release unused RAM to Windows. So I don't think management has to be that careful.
That's true it does but I've crashed things before by not paying attention to how much memory WSL2 & Windows apps were using as I tend to oversubscribe the WSL2 memory. The model is going to hit peak memory every timestep so best to assume WSL2 needs 26+Gb of RAM continually if running this app.

Glenn, have you had a chance to test this multi-core OIFS app on WSL2? I think the only multi-core BOINC app I've run on WSL2 is LHC Atlas and I've never been able to get it to run multi-core and had to resort to running it single core.
Did you set e.g. 'processors=4' in the .wslconfig ? I've not tried it. I'll let you know.

Let us say I am running one of those. And it is using 4 cores @ 98%. Why would that be any different from running 4 WCG tasks, for example? It would use only 4 cores, leaving me with 12 others for whatever.
The issue is not just cores, it's memory. WCG tasks use little memory. This configuration of OpenIFS will be actively using 26Gb, that's a lot of data to move around if other apps need it. It's also important that the option 'leave non-GPU tasks in memory while suspended' is selected for these tasks otherwise the model will be forced to restart from checkpoint files any time it gets suspended.

I'm curious if you have a similar calculation for previous OpenIFS tasks? Back then, I monitored on multiple hosts and they all turned out to be around 50GB host write to SSD per WU for previous OpenIFS tasks.
No, not without running one to see. I can only give ball-park numbers at the moment as we haven't worked out how much data output the scientist wants and the optimum checkpoint frequency.

I'm more interested in the IO burst that might be a problem for HDD and cloud..
Me too. That depends on individual hardware so I can't comment much.
---
CPDN Visiting Scientist
ID: 71636 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71638 - Posted: 16 Oct 2024, 13:31:30 UTC
Last modified: 16 Oct 2024, 13:35:49 UTC

Standalone Linux demo to try OpenIFS@60

I've created a standalone demo of the 60km configuration of OpenIFS which anyone can download and run for themselves. It does not run under BOINC, it's just the model process part of what the client would normally download. I thought it might be useful if people want to see what it does. You need to be familiar with running shell scripts under linux.

Do not attempt to run this on a machine with less than 32Gb memory. This application will require a peak memory of ~25Gb.

The link is: https://www.dropbox.com/scl/fi/xl7dcw0yqo1o159leemgk/oifs_demo.tgz?rlkey=xu6reo085bll8n2h0q1uwubvi&st=alr7zr65&dl=0 (please copy & paste into browser)

1/ Download to a folder where you intend to run it.
2/ Unpack the file with:
tar xf oifs_demo.tgz
This will create a directory 'oifs319_demo'. The unpacked filesize is 650Mb.

3/ Change into directory.
4/ The model is configured to run with 2 threads. To run the model just do:
./run_oifs

this will generate model output to the screen. You could use 'top' to verify the cpu percent usage is 200%.
To kill the model use CTRL-C.

If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS' but I suggest keeping to 4 or less as the parallel efficiency drops off markedly beyond 4 threads.

Note the threaded application is not statically linked and requires some dynamically loaded libraries. The demo is configured to run for 2 model days and output checkpoint files every 12 model hours.

To clean up the model output and go back to the initial files, run the script:
./clean

If anyone has problems getting this to run, please let me know. Hope this is useful. I will keep this there for a week or so. The file is 330Mb.
---
CPDN Visiting Scientist
ID: 71638 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 71639 - Posted: 16 Oct 2024, 15:06:51 UTC

If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS'

Top varies between 350 and 400 so far. I checked the line and it is 4 in the downloaded file.
ID: 71639 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71640 - Posted: 16 Oct 2024, 15:20:39 UTC - in response to Message 71639.  
Last modified: 16 Oct 2024, 15:20:50 UTC

If you want to try out running with more threads, edit 'run_oifs' and change the '2' on the line 'OMP_NUM_THREADS'
Top varies between 350 and 400 so far. I checked the line and it is 4 in the downloaded file.
Ok, my mistake. I forgot to reset it to '2'.
ID: 71640 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 71641 - Posted: 16 Oct 2024, 15:54:52 UTC - in response to Message 71640.  
Last modified: 16 Oct 2024, 20:02:30 UTC

Takes about 35 minutes to complete on my Ryzen9.
Edit: Ubuntu 24.04. No BOINC tasks running at the same time, 64GB RAM
ID: 71641 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71643 - Posted: 16 Oct 2024, 17:34:48 UTC - in response to Message 71636.  

The issue is not just cores, it's memory. WCG tasks use little memory. This configuration of OpenIFS will be actively using 26Gb, that's a lot of data to move around if other apps need it. It's also important that the option 'leave non-GPU tasks in memory while suspended' is selected for these tasks otherwise the model will be forced to restart from checkpoint files any time it gets sus


For me, on my Linux machine, my biggest tasks are the Rosetta ones: 2.2 to 2.5 GBytes of RAM required. I have no CPDN tasks running. The WCG ones are all MCM1.

top - 13:17:53 up 13 days,  1:27,  2 users,  load average: 12.33, 12.56, 13.18
Tasks: 486 total,  13 running, 473 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us,  0.4 sy, 74.6 ni, 23.6 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem : 128086.0 total,   2546.9 free,  15399.7 used, 110139.4 buff/cache
MiB Swap:  15992.0 total,  15700.5 free,    291.5 used. 109343.6 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
2516393    2086 boinc     39  19 R   2.5g   2.0  99.1  3 383:27.90 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ 
2527376    2086 boinc     39  19 R   2.4g   2.0  99.3 12 291:01.71 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ 
2530806    2086 boinc     39  19 R   2.4g   1.9  99.2  0 261:36.43 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+ 
2582689    2086 boinc     39  19 R   2.2g   1.8  99.0  9  65:17.35 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-li+
...
2584661    2086 boinc     39  19 R  39824   0.0  99.4  2  61:48.06 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+
...

ID: 71643 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 71644 - Posted: 16 Oct 2024, 19:00:37 UTC

On a 5800X3D running Ubuntu 24.04, the 4 core config ran in 40 minutes, and the 2 core config in 72.minutes.

I tried running it on a Linux Mint 20.3 PC, fully updated, and it gave the following errors about glibc versions

++ ./oifs_43r3_omp_model.exe
./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./oifs_43r3_omp_model.exe)
./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by ./oifs_43r3_omp_model.exe)
./oifs_43r3_omp_model.exe: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./oifs_43r3_omp_model.exe)

glibc version on this PC is 2.31

Anyone running Ubuntu 20.04 LTS or anything based on this or earlier, won't have proper glibc versions to run it. Probably also Ubuntu 22.04 LTS will have a problem, but I don't have an installation of that to test it.
ID: 71644 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested

©2024 cpdn.org