Message boards : Number crunching : Best Swap file size for CPDN?
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
Server: 128c/256t Currently running 512GB memory with a default OS Swap file of 2.1GB (which is 75% - 100% in usage). What amount I should set it for the best results for CPDN? |
Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228 |
Server: 128c/256t Zero. Adjust then number of tasks you run so that you don’t swap out - more efficient and less likely to crash the tasks. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
In terms of running less tasks, I actually built this server in order to try and do the opposite; but I hear what you are saying. What is the max amount of memory you would reserve for each task? Thinking about what I have seen in usage, I would say each needs probably around 3-5GB |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
If I am right in thinking that it only uses the Swap file when there is not enough space in Memory, EG: L1 Cache->L2 Cache->L3 Cache->Memory->Swap File Then why would it use the Swap file for less than 2GB when there is 300GB+ of free memory? (as reported by Ubuntu). |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
These OIFS PS tasks have a high water memory of ~5.5Gb which it hits during every timestep. However, that's just the model itself, the controlling wrapper will also take a small amount of RAM for zipping the upload files. So take a look at the rsc_memory_bound in the client_state.xml for the task and use that figure. Do not allocate less. Do not let the model start swapping, it won't kill the task but it will slow it considerably. I tried this myself as part of testing. Bear in mind modern linuxes also use spare RAM as a disk cache, and that's a good thing for maintaining good throughput because the model log & status files are read by the oifs controlling wrapper so it's ideal to have these in RAM. RAM is your friend for CPDN, swap is your enemy. I hope these current tasks, and the next batches of the BL OIFS app, will be the last of the lowest resolution version of OpenIFS. The next batches I aim to test in 2023 will be higher resolutions for which the high-water memory will be 8Gb, 14Gb & 22Gb respectively. However, in order to get decent throughput I'm working on multi-core, though that's proving tricky to get stable across different platforms, so I may be forced to use a VM. In terms of running less tasks, I actually built this server in order to try and do the opposite; but I hear what you are saying. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
Thank you Glenn that's really well explained and very helpful. Looking at: /var/lib/boinc-client/projects/climateprediction.net/client_state.xml <rsc_memory_bound>6010000000.000000</rsc_memory_bound> So that's just over 6GB. So if I reserved 6GB per task would that be enough? Thank you also for your heads up on future developments, that's very interesting and helps when thinking about future memory purchases to run CPDN. Hopefully future apps won't use Virtualbox (if that's what you mean by VM), as for many of us its not worth installing/using due to all the issues it causes. This why we don't run LHC or Rosetta (Python) even though we would love to. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
These OIFS PS tasks have a high water memory of ~5.5Gb which it hits during every timestep. However, that's just the model itself, the controlling wrapper will also take a small amount of RAM for zipping the upload files. So take a look at the rsc_memory_bound in the client_state.xml for the task and use that figure. Do not allocate less. Do not let the model start swapping, it won't kill the task but it will slow it considerably. I tried this myself as part of testing. Bear in mind modern linuxes also use spare RAM as a disk cache, and that's a good thing for maintaining good throughput because the model log & status files are read by the oifs controlling wrapper so it's ideal to have these in RAM. RAM is your friend for CPDN, swap is your enemy. I hope these current tasks, and the next batches of the BL OIFS app, will be the last of the lowest resolution version of OpenIFS. The next batches I aim to test in 2023 will be higher resolutions for which the high-water memory will be 8Gb, 14Gb & 22Gb respectively. However, in order to get decent throughput I'm working on multi-core, though that's proving tricky to get stable across different platforms, so I may be forced to use a VM. Hopefully you are okay if I share this advice? I think many people will find it interesting and/or essential; I'll assume you are. I'll also add the e-Research Group CPDN donation link when sharing this. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Hopefully you are okay if I share this advice? I think many people will find it interesting and/or essential; I'll assume you are.Not sure I understand - my reply is on a public forum, so just point people to the message URL? I've already said similar on these forums many times, it's nothing new. I forgot to add that if you want to check a running task go into the slot directory and inspect the contents of the file boinc_task_state.xml which will give you the boinc client measured values for peak swap size, disk etc for the 'task' (not just the model): $ cd slots/0 $ cat boinc_task_state.xml <active_task> <project_master_url>https://climateprediction.net/</project_master_url> <result_name>oifs_43r3_ps_0928_2007050100_123_976_12193572_0</result_name> <checkpoint_cpu_time>5150.980000</checkpoint_cpu_time> <checkpoint_elapsed_time>5241.193137</checkpoint_elapsed_time> <fraction_done>0.109151</fraction_done> <peak_working_set_size>5009731584</peak_working_set_size> <peak_swap_size>5215879168</peak_swap_size> <peak_disk_usage>1446309687</peak_disk_usage> </active_task> |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Looking at:Yes for these batches. But the same app can be used in batches with different memory requirements - bear that in mind. Hopefully future apps won't use Virtualbox (if that's what you mean by VM), as for many of us its not worth installing/using due to all the issues it causes. This why we don't run LHC or Rosetta (Python) even though we would love to.May not have a choice. It's the only route to a Windows capable app and may well turn out to be the only route to a multicore app. 'Fraid I don't understand what you mean by 'all the issues VBox causes'. I have it running on my machines without a problem. The only issue I did have was boinc had stuffed up some fsys permissions in the systemd config for the client which prevented it working, but that was the fault of boinc not VBox. Your choice of course if you don't want to run it. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Hopefully future apps won't use Virtualbox (if that's what you mean by VM), as for many of us its not worth installing/using due to all the issues it causes. This why we don't run LHC or Rosetta (Python) even though we would love to. I hope so too. I have 64 GigaBytes of RAM (and could double it if necessary) on my "16-core" machine (though I am allowing Boinc to use only 12 at the moment). And I did run some multi-process Milkyway tasks. But I will not be running Virtualbox. I had no trouble running 5 Oifs_ps tasks at the same time as 7 other Boinc tasks for other projects were running. I think at one point I was running 5 Oifs_ps and 1 Oifs_bl at one time and 6 other Boinc tasks for other projects, also with no trouble. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
Okay I have removed the postings. Something to keep in mind though, participation in BOINC is dropping and so if the only project engagement is within the forums then you are not really going to increase awareness and participation in BOINC and/or CPDN. If you are versed in Linux, and BOINC, then you can probably can solve most of the issues when using VirtualBox. But to those from a Windows background (like myself and many others) Linux can be headache when its working, but when you have issues its a nightmare. Thanks for all your help. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
If you are versed in Linux, and BOINC, then you can probably can solve most of the issues when using VirtualBox. But to those from a Windows background (like myself and many others) Linux can be headache when its working, but when you have issues its a nightmare. What I meant to say is if you are versed in BOINC then you can probably work out all the issues with VirtualBox, but most people are not. In terms of the other stuff, if you are versed in Linux then yes perhaps it is all common stuff and easy, but to those from a Windows background its not. Anyway just thought I would clarify. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Something to keep in mind though, participation in BOINC is dropping and so if the only project engagement is within the forums then you are not really going to increase awareness and participation in BOINC and/or CPDN. I have been running BOINC since 1999-06-10 or thereabouts, and ran seti@home for a little while before that. In those days there was always work available. But seti@home is no more. I started running ClimatePrediction whenever I discovered it and it joined up with Boinc and there was always work available for it too. Sometime a task took over a month to run. Machines were slower then and the tasks covered very long time intervals. Now it is difficult to get work units, and in the last week or two, it has become impossible to return results. Similarly for Rosetta. It goes for long intervals with no work at all, and lately, it became impossible to download such work as they tried to send. Now they are not even trying. And World Community Grid! They were down for about eight months! Hardly encouraging for possible new participants. Even now that they are supposedly up, they support only four or five different projects and really only three. And I often run out or work from them. My view is that while some former computer users may have dropped out of distributed computing and just play with FaceBook on their cell phones, what affects me most is that those who could benefit from it are just not supplying work. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,845,927 RAC: 19,699 |
May not have a choice. It's the only route to a Windows capable app and may well turn out to be the only route to a multicore app. I don't think you're saying that multicore app might only be a Windows app, correct? I'd vote for a Linux native app and a VBox app, like what LHC has with ATLAS and Theory. What about different platforms that's making it difficult to get multicore stable, the different hardware configurations or OSs? |
Send message Joined: 4 Oct 19 Posts: 15 Credit: 9,174,915 RAC: 3,722 |
If vbox is the way to a windows app, so be it. But given we have a functioning native linux app, i would love to see a multi-threaded native linux app. I would run much of those. Whereas linux vbox is a total non-starter for me (many attempts, zero successes), and my few windows boxes won't have enough memory. Fingers crossed. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
If you are versed in Linux, and BOINC, then you can probably can solve most of the issues when using VirtualBox. But to those from a Windows background (like myself and many others) Linux can be headache when its working, but when you have issues its a nightmare.. I'm afraid I don't understand what you mean by 'issues with VBox'. Maybe you can PM me directly and explain exactly what you mean. I would be interested to know what we might have to deal with & provide support for. I've used virtualization for many yrs. To get Vbox working on linux was nothing more than apt install, reboot & make sure bios virtualization support is enabled, and off boinc went. I can't tell from your messages whether you hit showstoppers or it's just the added technical load that's the issue (I appreciate these concerns). Maybe it's worth a new thread for discussion on this? I agree linux is an issue for Windows users which is why CPDN would love to have a windows app for OpenIFS. But the only way we can do that is with Vbox I'm afraid. We had a 3 month project to build OpenIFS on Windows but we were not able to get the key library it needs working well enough let alone the model. Unfortunately ECMWF does not support Windows for its inhouse software libraries so there's really no other option (unless someone volunteers to do the work). I would love to have a native multicore linux app. I have already done some testing with one but it started failing on some test machines in the pthread library for reasons I don't yet understand. I need to do more work on this but a VBox app looks straightforward, so I'm tempted to go for the low-hanging fruit first which would open up more Windows users & multicore, and then get back to a native multicore linux app. We still have the single core app of course but with higher resolution will come much longer runtimes. But for the immediate future I will be working on getting the model more stable for the current setup. These latest batches have exposed a few more issues that I would like to resolve first before I move on. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
I agree let's split the VirtualBox issues into its own thread. https://www.cpdn.org/forum_thread.php?id=9171#67174 In terms of Windows/Linux etc, I'll reply back in the New year. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
The 256t server is now being told by CPDN that: "You have reached your quota of 1 task per day". Yesterday I spent £200 on a new U.2 drive in order to try and continue running CPDN as the old drive is currently full of completed tasks. Can anything be done to remove this quota? The server is now running at 30% eg 6+ GB /task, so there should be no reason why any further errors should happen (from my end). |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762 |
The 256t server is now being told by CPDN that: "You have reached your quota of 1 task per day".I am also getting that message in the log and am only able to download and run one task per day. For me, it isn't a massive issue as my slow bored band means it will take a while to clear the backlog of completed tasks. I have been able to get more tasks by running BOINC in VB as well as in my host OS but even so, it will take 2-3 days of saturating my upload link to clear it all. I have only been running one task at a time in the VM but could change the VM settings to allow more. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
The 256t server is now being told by CPDN that: "You have reached your quota of 1 task per day".'Quota' is a concept bound up with successful completion of tasks. Looking at the task list for your 256t machine, the vast majority are ending with "Error while computing". You're wasting time and energy. You need to sort that out before your quota will start to heal. It's nothing to do with the network problems. Edit - you've made me check my own traffic jam, just to check that nothing on the server is causing it to mark delayed tasks as errors. All are still shown as "in progress", which is as it should be for delayed uploads. Phew. |
©2024 cpdn.org