Thread 'Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested'

Author	Message
Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71595 - Posted: 14 Oct 2024, 13:50:55 UTC People might recall last year I was talking about running the OpenIFS model at higher resolution configurations (60km grids) instead of the normal size we run (125km) grids. We now have a project that wants to make use of this configuration. Before we decide whether to go ahead, CPDN would like some feedback from users as this is a substantial increase on the normal resources used. There are two key issues: memory required and the size of the checkpoint files. OpenIFS@60km would have a peak memory requirement of roughly 25Gb. The checkpoint (or restart) files which are normally written periodically would be approx 4Gb. This compares to 6Gb RAM & 1Gb checkpoint filesize for the resolution configurations we have run to date. The question is how to volunteers feel about this? The model would run multi-core for improved completion. Also there would be credit multipliers used to take into account the extra memory & disk required. We also think this should be an 'opt-in' application via the project preferences and the scheduler would only allow 1 task per host. There are some technical steps I can look into to make reductions in memory & disk usage but for now I'm interested if anyone would be prepared to run this or whether it's a non-starter? Thanks. --- CPDN Visiting Scientist ID: 71595 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 83 Credit: 17,184,625 RAC: 13,161	Message 71596 - Posted: 14 Oct 2024, 14:40:19 UTC - in response to Message 71595. How long would you guess they might take to complete, say at 4 core or 32 core ? Will they utilise all cores for the majority of the time ? Would probably give it a serious go if cores are kept busy but duration would be a factor too. ID: 71596 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 71597 - Posted: 14 Oct 2024, 15:07:30 UTC - in response to Message 71595. There are some technical steps I can look into to make reductions in memory & disk usage but for now I'm interested if anyone would be prepared to run this or whether it's a non-starter? I think it should definately be a starter. My main machine (Computer 1511241) runs Red Hat Enterprise Linux release 8.10 (Ootpa) with kernel 4.18.0-553.22.1.el8_10.x86_64. While it has 16 cores (8 real, 8 hyperthreaded), I usually allow 12 for Boinc except when the summer it gets too hot and I reduce it to 8 or 10. As for checkpoint files, I run Boinc in a partition all its own. Right now, running 12 Boinc tasks, it uses hardly any of this space. Memory 125.08 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 479.37 GB ID: 71597 · Reply Quote

LCB001 Send message Joined: 5 May 13 Posts: 1 Credit: 5,893,091 RAC: 2,625	Message 71598 - Posted: 14 Oct 2024, 15:20:38 UTC Would be interested in running these especially if the mem requirements could be dropped a bit. ID: 71598 · Reply Quote

rob Send message Joined: 5 Jun 09 Posts: 99 Credit: 3,776,658 RAC: 1,196	Message 71599 - Posted: 14 Oct 2024, 15:47:43 UTC Sounds interesting - if only I had enough memory... ID: 71599 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71600 - Posted: 14 Oct 2024, 16:11:59 UTC - in response to Message 71596. How long would you guess they might take to complete, say at 4 core or 32 core ? Will they utilise all cores for the majority of the time ? Would probably give it a serious go if cores are kept busy but duration would be a factor too. The model is very efficient, the cores will be used >95% of the time except when it's doing I/O which is on a single core. You'll also see lower efficiency if the machine is busy with other jobs. As for how long, this OpenIFS@60 config runs 3x slower than the configurations used so far. But we would run with 2 cores (to begin with), so if you had a OpenIFS task before, multiply time taken by 1.5 roughly. YMMV. I'm more concerned with how people feel about the large 4Gb checkpoint files being written to their drives periodically, particularly if anyone's running servers. I can adjust the frequency of the writes but at risk of repeating a lot of computation. The model results are highly compressed (Mb not Gb) and pose less of an issue but need to be added to the total data load. I'm investigating new compression algorithms to see if I can get the checkpoint filesize down. --- CPDN Visiting Scientist ID: 71600 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 71601 - Posted: 14 Oct 2024, 16:51:09 UTC You can include me in the group of users willing to give them a try. My older machines are still in test mode with 16 GB RAM, but I can restore them back up to 64 GB / 32 GB when I get a convenient moment. All the machines I use for CPDN have a dedicated 2 TB SSD for BOINC data, and I have a decent upload speed for returning the results. ID: 71601 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 83 Credit: 17,184,625 RAC: 13,161	Message 71602 - Posted: 14 Oct 2024, 17:14:13 UTC - in response to Message 71600. I'd prefer the checkpointing to be user defined, in which case I'd probably turn it off ! If 2 cores works and you can quickly scale it up to more (user defined) so much the better. ID: 71602 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71603 - Posted: 14 Oct 2024, 18:36:48 UTC - in response to Message 71602. Last modified: 14 Oct 2024, 18:38:54 UTC I'd prefer the checkpointing to be user defined, in which case I'd probably turn it off ! If 2 cores works and you can quickly scale it up to more (user defined) so much the better. I wondered about people being able to turn if off. But if the machine is rebooted the task would have to start all over again. If you're prepared to accept that, I can look into it at the user level. But I'm not going to make the checkpoint frequency user definable. I would not go above 4 cores for this configuration. --- CPDN Visiting Scientist ID: 71603 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 83 Credit: 17,184,625 RAC: 13,161	Message 71604 - Posted: 14 Oct 2024, 18:59:11 UTC - in response to Message 71603. I'm prepared to accept that, I have very rare unplanned shutdowns caused by external factors. Failures due to the model computations rather than the user environment would be of higher concern. 4 cores is probably fine, they shouldn't be running that long from what you've said. How many workunits would you expect there to be for this configuration ? ID: 71604 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71605 - Posted: 14 Oct 2024, 19:22:33 UTC - in response to Message 71604. Last modified: 14 Oct 2024, 19:22:40 UTC I'm prepared to accept that, I have very rare unplanned shutdowns caused by external factors. Ok, noted. That's useful feedback. How many workunits would you expect there to be for this configuration ? The project in question is likely to release 4-5 batches of ~2000 workunits each. Ideally we'd like the first batch to go out end of this year or early next but there's still some development work to be done. --- CPDN Visiting Scientist ID: 71605 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 71606 - Posted: 14 Oct 2024, 19:35:45 UTC I'd definitely be up for running it. What would you say be the peak disk space requirement per task? The data files, plus multiples of the 4GB checkpoints. I'm wondering if I'd have to increase disk space allowed for BOINC, which is not a problem. I'll think about other questions/suggestions and post later. ID: 71606 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71607 - Posted: 14 Oct 2024, 21:22:43 UTC - in response to Message 71606. I'd definitely be up for running it. What would you say be the peak disk space requirement per task? The data files, plus multiples of the 4GB checkpoints. I'm wondering if I'd have to increase disk space allowed for BOINC, which is not a problem. I'll think about other questions/suggestions and post later. There's only ever 1 checkpoint file (4Gb for the new config). Plus a day's worth of model output becomes 90Mb instead of 22Mb for the batches so far. So it's the checkpoint file that dominates. Peak space requirements perhaps when the uploads are not working, then the number of model output starts increasing but there's still only ever 1 checkpoint. Back of envelope, let's say ~20 uploads of model output waiting to transfer, roughly 1.75Gb extra on top of the checkpoint 4Gb? It's not huge and if I add compression I should be able to get the checkpoint file down to less than 3Gb. Wear & tear on the storage might be a concern? (personally it's not but I want to raise it). You will definitely notice the impact of this configuration running on your machine if you are using it. --- CPDN Visiting Scientist ID: 71607 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 83 Credit: 17,184,625 RAC: 13,161	Message 71608 - Posted: 14 Oct 2024, 21:36:00 UTC - in response to Message 71607. Don't you create a new checkpoint file before deleting the old checkpoint file ? ID: 71608 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 493 Credit: 31,669,049 RAC: 10,904	Message 71609 - Posted: 14 Oct 2024, 22:27:34 UTC - in response to Message 71595. 4 core CPU, 32Gb RAM and 2Tb hdd on a machine that is only used for CPDN - I see no real problem for me. ID: 71609 · Reply Quote

rfbrooks Send message Joined: 31 Aug 04 Posts: 10 Credit: 7,456,207 RAC: 4,566	Message 71610 - Posted: 14 Oct 2024, 22:51:19 UTC I would be open to it. Upgraded to 32 GB RAM specifically to handle this type of task. ID: 71610 · Reply Quote

cetus Send message Joined: 7 Aug 04 Posts: 10 Credit: 148,100,750 RAC: 29,951	Message 71611 - Posted: 15 Oct 2024, 2:01:44 UTC I'll definitely run these. I too have been getting lots of ram for the last few years in order to run large models - I'll be happy to see them show up. The large checkpoint files shouldn't be an issue for me either. The only problem I had with OIFS models in the past was the internet bandwidth for uploading. Since you can't run very many of these at a time, hopefully that won't be as big of an issue for these models. ID: 71611 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 71612 - Posted: 15 Oct 2024, 4:20:27 UTC - in response to Message 71611. Last modified: 15 Oct 2024, 4:22:43 UTC The only problem I had with OIFS models in the past was the internet bandwidth for uploading. I do not expect that to be a problem for me. I have Verizon FiOS fiber-optic to the house Internet connection that is about 1 GigaBit per second up and down. So if does not get choked up at my end. Up to the servers. I notice it is a little slow tonight. I am listening to video at the same time I ran this speed test. Timestamp Download Upload Latency Jitter Quality Score Test Server 10/15/2024 0:1:23 878.87 Mbps 830.28 Mbps 6 ms 1 ms Excellent nyc.mega.host.speedtest.net.prod.hosts.ooklaserver.net 10/3/2024 23:41:31 860.61 Mbps 860.97 Mbps 4 ms 2 ms Excellent nyc.mega.host.speedtest.net.prod.hosts.ooklaserver.net 6/25/2024 13:57:52 819.96 Mbps 925.68 Mbps 5 ms 1 ms Excellent nyc.mega.host.speedtest.net.prod.hosts.ooklaserver.net ID: 71612 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 71613 - Posted: 15 Oct 2024, 7:22:50 UTC Last modified: 15 Oct 2024, 7:31:56 UTC You won't be surprised that I am up for running these. Having 64GB of RAM now and 32 cores. Looking forward to them on the testing site. Running just one at a time will prevent a large backlog of data uploads when 2 cores are being used. It might not do so with 4. Though I am guessing I just wouldn't get a new task of this type before the first one uploaded. ID: 71613 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 71614 - Posted: 15 Oct 2024, 7:34:26 UTC - in response to Message 71607. Ok, a few Gb per task peak disk requirement is not a problem for me. Disk usage is also not a concern, relatively modern SSDs last a long time even with heavy usage. I'd also like the ability to use more cores. In addition, perhaps allow users to choose to run 2-3 tasks at a time, if they have the RAM. ID: 71614 · Reply Quote