Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
People might recall last year I was talking about running the OpenIFS model at higher resolution configurations (60km grids) instead of the normal size we run (125km) grids. We now have a project that wants to make use of this configuration. Before we decide whether to go ahead, CPDN would like some feedback from users as this is a substantial increase on the normal resources used. There are two key issues: memory required and the size of the checkpoint files. OpenIFS@60km would have a peak memory requirement of roughly 25Gb. The checkpoint (or restart) files which are normally written periodically would be approx 4Gb. This compares to 6Gb RAM & 1Gb checkpoint filesize for the resolution configurations we have run to date. The question is how to volunteers feel about this? The model would run multi-core for improved completion. Also there would be credit multipliers used to take into account the extra memory & disk required. We also think this should be an 'opt-in' application via the project preferences and the scheduler would only allow 1 task per host. There are some technical steps I can look into to make reductions in memory & disk usage but for now I'm interested if anyone would be prepared to run this or whether it's a non-starter? Thanks. --- CPDN Visiting Scientist |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,057 RAC: 90,697 |
How long would you guess they might take to complete, say at 4 core or 32 core ? Will they utilise all cores for the majority of the time ? Would probably give it a serious go if cores are kept busy but duration would be a factor too. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
There are some technical steps I can look into to make reductions in memory & disk usage but for now I'm interested if anyone would be prepared to run this or whether it's a non-starter? I think it should definately be a starter. My main machine (Computer 1511241) runs Red Hat Enterprise Linux release 8.10 (Ootpa) with kernel 4.18.0-553.22.1.el8_10.x86_64. While it has 16 cores (8 real, 8 hyperthreaded), I usually allow 12 for Boinc except when the summer it gets too hot and I reduce it to 8 or 10. As for checkpoint files, I run Boinc in a partition all its own. Right now, running 12 Boinc tasks, it uses hardly any of this space. Memory 125.08 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 479.37 GB |
Send message Joined: 5 May 13 Posts: 1 Credit: 5,795,262 RAC: 3,236 |
Would be interested in running these especially if the mem requirements could be dropped a bit. |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,735,198 RAC: 4,113 |
Sounds interesting - if only I had enough memory... |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
How long would you guess they might take to complete, say at 4 core or 32 core ?The model is very efficient, the cores will be used >95% of the time except when it's doing I/O which is on a single core. You'll also see lower efficiency if the machine is busy with other jobs. As for how long, this OpenIFS@60 config runs 3x slower than the configurations used so far. But we would run with 2 cores (to begin with), so if you had a OpenIFS task before, multiply time taken by 1.5 roughly. YMMV. I'm more concerned with how people feel about the large 4Gb checkpoint files being written to their drives periodically, particularly if anyone's running servers. I can adjust the frequency of the writes but at risk of repeating a lot of computation. The model results are highly compressed (Mb not Gb) and pose less of an issue but need to be added to the total data load. I'm investigating new compression algorithms to see if I can get the checkpoint filesize down. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
You can include me in the group of users willing to give them a try. My older machines are still in test mode with 16 GB RAM, but I can restore them back up to 64 GB / 32 GB when I get a convenient moment. All the machines I use for CPDN have a dedicated 2 TB SSD for BOINC data, and I have a decent upload speed for returning the results. |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,057 RAC: 90,697 |
I'd prefer the checkpointing to be user defined, in which case I'd probably turn it off ! If 2 cores works and you can quickly scale it up to more (user defined) so much the better. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I'd prefer the checkpointing to be user defined, in which case I'd probably turn it off ! I wondered about people being able to turn if off. But if the machine is rebooted the task would have to start all over again. If you're prepared to accept that, I can look into it at the user level. But I'm not going to make the checkpoint frequency user definable. I would not go above 4 cores for this configuration. --- CPDN Visiting Scientist |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,057 RAC: 90,697 |
I'm prepared to accept that, I have very rare unplanned shutdowns caused by external factors. Failures due to the model computations rather than the user environment would be of higher concern. 4 cores is probably fine, they shouldn't be running that long from what you've said. How many workunits would you expect there to be for this configuration ? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I'm prepared to accept that, I have very rare unplanned shutdowns caused by external factors.Ok, noted. That's useful feedback. How many workunits would you expect there to be for this configuration ?The project in question is likely to release 4-5 batches of ~2000 workunits each. Ideally we'd like the first batch to go out end of this year or early next but there's still some development work to be done. --- CPDN Visiting Scientist |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,816,935 RAC: 19,934 |
I'd definitely be up for running it. What would you say be the peak disk space requirement per task? The data files, plus multiples of the 4GB checkpoints. I'm wondering if I'd have to increase disk space allowed for BOINC, which is not a problem. I'll think about other questions/suggestions and post later. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I'd definitely be up for running it. What would you say be the peak disk space requirement per task? The data files, plus multiples of the 4GB checkpoints. I'm wondering if I'd have to increase disk space allowed for BOINC, which is not a problem. I'll think about other questions/suggestions and post later.There's only ever 1 checkpoint file (4Gb for the new config). Plus a day's worth of model output becomes 90Mb instead of 22Mb for the batches so far. So it's the checkpoint file that dominates. Peak space requirements perhaps when the uploads are not working, then the number of model output starts increasing but there's still only ever 1 checkpoint. Back of envelope, let's say ~20 uploads of model output waiting to transfer, roughly 1.75Gb extra on top of the checkpoint 4Gb? It's not huge and if I add compression I should be able to get the checkpoint file down to less than 3Gb. Wear & tear on the storage might be a concern? (personally it's not but I want to raise it). You will definitely notice the impact of this configuration running on your machine if you are using it. --- CPDN Visiting Scientist |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,442,057 RAC: 90,697 |
Don't you create a new checkpoint file before deleting the old checkpoint file ? |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,970,928 RAC: 14,160 |
4 core CPU, 32Gb RAM and 2Tb hdd on a machine that is only used for CPDN - I see no real problem for me. |
Send message Joined: 31 Aug 04 Posts: 10 Credit: 6,824,482 RAC: 15,454 |
I would be open to it. Upgraded to 32 GB RAM specifically to handle this type of task. |
Send message Joined: 7 Aug 04 Posts: 10 Credit: 148,018,746 RAC: 39,880 |
I'll definitely run these. I too have been getting lots of ram for the last few years in order to run large models - I'll be happy to see them show up. The large checkpoint files shouldn't be an issue for me either. The only problem I had with OIFS models in the past was the internet bandwidth for uploading. Since you can't run very many of these at a time, hopefully that won't be as big of an issue for these models. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The only problem I had with OIFS models in the past was the internet bandwidth for uploading. I do not expect that to be a problem for me. I have Verizon FiOS fiber-optic to the house Internet connection that is about 1 GigaBit per second up and down. So if does not get choked up at my end. Up to the servers. I notice it is a little slow tonight. I am listening to video at the same time I ran this speed test. Timestamp Download Upload Latency Jitter Quality Score Test Server 10/15/2024 0:1:23 878.87 Mbps 830.28 Mbps 6 ms 1 ms Excellent nyc.mega.host.speedtest.net.prod.hosts.ooklaserver.net 10/3/2024 23:41:31 860.61 Mbps 860.97 Mbps 4 ms 2 ms Excellent nyc.mega.host.speedtest.net.prod.hosts.ooklaserver.net 6/25/2024 13:57:52 819.96 Mbps 925.68 Mbps 5 ms 1 ms Excellent nyc.mega.host.speedtest.net.prod.hosts.ooklaserver.net |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
You won't be surprised that I am up for running these. Having 64GB of RAM now and 32 cores. Looking forward to them on the testing site. Running just one at a time will prevent a large backlog of data uploads when 2 cores are being used. It might not do so with 4. Though I am guessing I just wouldn't get a new task of this type before the first one uploaded. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,816,935 RAC: 19,934 |
Ok, a few Gb per task peak disk requirement is not a problem for me. Disk usage is also not a concern, relatively modern SSDs last a long time even with heavy usage. I'd also like the ability to use more cores. In addition, perhaps allow users to choose to run 2-3 tasks at a time, if they have the RAM. |
©2024 cpdn.org