Message boards : Number crunching : How to Prevent OpenIFS Download
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Nov 05 Posts: 26 Credit: 687,388 RAC: 529 |
Is it possible to stop my host from downloading OpenIFS tasks, other than by setting No New Tasks for CPDN? The virtual memory use (disk thrashing) brings my host almost to a standstill, and even if I let it run the trickles don't upload. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,730,030 RAC: 12,757 |
If you are seeing disk thrashing because the machine is swapping, the machine is running too many OpenIFS tasks. Unfortunately, there's an issue with the boinc client that it will start up as many tasks as free cores available to boinc. It does not respect the memory limit of the task, leaving it to volunteers like yourself to fix it. The problem with the client was unexpected and we're looking into workarounds we can put in place on the server to deal with this. You can either adjust your percentage CPUs in boincmgr to reduce the available cpus, or, use an app_config.xml file in the project directory (example below). We've since found that LHC also hit this problem so we'll probably follow their approach to limiting tasks downloaded to machines. An app config is a nice way of controlling exactly how many tasks the client is allowed to run at any time (irrespective of how many tasks are downloaded). The file is specific to a project and should be placed in the /var/lib/boinc/projects/climateprediction.net directory (or wherever your boinc software is installed). Mine looks like this. I set a max of 6 tasks in total across all CPDN apps, and for each OpenIFS app variant, no more than 6 tasks at a time. Each task takes ~5Gb memory so make sure you have enough free RAM and adjust the values below to fit. <app_config> <project_max_concurrent>6</project_max_concurrent> <report_results_immediately/> <app> <name>oifs_43r3</name> <max_concurrent>6</max_concurrent> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>6</max_concurrent> </app> <app> <name>oifs_43r3_bl</name> <max_concurrent>6</max_concurrent> </app> </app_config> Hope that helps. There's plenty of info about app_config.xml files online if you want to find out more, or ask on the forums as plenty of people know about them. Edit: sorry forgot to answer your question about stopping downloads. You can pause the project and that will stop. However, CPDN have paused the batch server so noone will be getting any more tasks for the time being as there's a backlog of data to be dealt with. Is it possible to stop my host from downloading OpenIFS tasks, other than by setting No New Tasks for CPDN? The virtual memory use (disk thrashing) brings my host almost to a standstill, and even if I let it run the trickles don't upload. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Short answer, No. CPDN used like some other projects to allow one to choose which model types you run on a particular host the website but no longer does. You are right at the minimum RAM for these tasks. If you are running more than two at once you will be using swap a lot. I would suggest lowering the number of cores in use to one if you have anything above minimal non-boinc usage. The, "Up[loads are stuck" thread contains details of the saga of lack of disk space/transfer speed to backup storage/tape drive failures. Also OIFS is likely to be the majority of tasks for the short and medium term with others only appearing very occasionally. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,768,004 RAC: 3,189 |
Hope that helps. There's plenty of info about app_config.xml files online if you want to find out more, or ask on the forums as plenty of people know about them.The official manual is on the BOINC website, at https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration |
Send message Joined: 9 Mar 22 Posts: 30 Credit: 1,065,239 RAC: 556 |
You may also need to upgrade your BOINC client to at least 7.20.x since older versions suffer from a bug related to the 'max_concurrent' options. See: https://github.com/BOINC/boinc/pull/4592 |
Send message Joined: 3 Nov 05 Posts: 26 Credit: 687,388 RAC: 529 |
Thank you for your quick replies. Dave Jackson spotted the problem. At one point, whilst I was using the computer, BOINC quietly downloaded and ran four OpenIFS tasks simultaneously, plus the two hadam4 tasks it was already running, which was hilarious. At the moment the host is still crunching on one hadam4 (along with other, much less memory-intensive non CPDN tasks), so one OpenIFS is way too many. Is it valid to set <max_concurrent> for each of the OpenIFS apps to 0 (I can set it to 1 and try again once the hadam4 has completed)? The official manual doesn't say. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
Huh. I was wondering why that didn't seem to be working right. Looks like 7.16 is what's in the 20.04 repos. I suppose I should upgrade my boxes, may as well, not like they're doing much work... |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 16,097,824 RAC: 62,157 |
Is it valid to set <max_concurrent> for each of the OpenIFS apps to 0 (I can set it to 1 and try again once the hadam4 has completed)? The official manual doesn't say.0 is used to indicate no limit, so will try and run as many as <project_max_concurrent> allows or as many as the client thinks it can run if that isn't set (or is also set to 0). |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
OpenIFS apps to 0 (I can set it to 1 and try again once the hadam4 has completed)? The official manual doesn't say.Richard probably knows. There are a number of things where BOINC treats "0" as meaning no restriction which would make my choice setting it to 1 in your situation. Edit: The server uses -1 to indicate a blacklisted computer that will not get any tasks. (CPDN used to use this to stop machines without the 32bit libraries which crashed everything getting work but hasn't recently.) So that may work to indicate not running any tasks of a particular type. |
Send message Joined: 3 Nov 05 Posts: 26 Credit: 687,388 RAC: 529 |
OK, looks like I've no choice but to opt for No New Tasks (Edit: Or try -1 :)). And yes, it looks like the repo for Ubuntu 20.04 LTS needs updating. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
And yes, it looks like the repo for Ubuntu 20.04 LTS needs updating. Richard has recently posted instructions for using Gianfranco's repository which while not official is in general pretty reliable. It is a much simpler option than compiling your own which is what I do. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,768,004 RAC: 3,189 |
Richard has recently posted instructions...That was message 67761, and the person I was advising seemed happy with the instructions on the page I suggested. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
Richard has recently posted instructions...That was message 67761, and the person I was advising seemed happy with the instructions on the page I suggested. Oh, great! Yeah, that's easy enough to toss in. I should probably update them to 22.04 anyway, though. Now's as good a time as any, they're just chewing on WCG tasks when they get any. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085 |
Glenn Carver wrote: [...] there's an issue with the boinc client that it will start up as many tasks as free cores available to boinc. It does not respect the memory limit of the task, leaving it to volunteers like yourself to fix it. The problem with the client was unexpected and we're looking into workarounds we can put in place on the server to deal with this. [...] We've since found that LHC also hit this problem so we'll probably follow their approach to limiting tasks downloaded to machines.I haven't been at lhc@home for a while, so don't know what their approach looks like. But a limit on tasks in progress is not a good replacement for the desired limit on tasks which are executing. Stages of a "task in progress": (ready to send) – assigned to host – downloading – ready to run – executing – uploading – ready to report (reported) Each of the stages can take unpredictably long for a variety of reasons. Hence it's clear that # in progress cannot control # executing very well, to put it mildly. Also, oifs_43r3_ps concurrency is only part of the equation. The other part is what else is going on on the host. It is a big difference if the host is running a desktop environment or is a dedicated cruncher. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,730,030 RAC: 12,757 |
Unfortunately we have to work with what we can. It's a workaround to have these controls in place, but at least there is something we can do. It's also clear if we 'do nothing' we end up with chaos on volunteer machines who do not (and why should they?) have app_config files in place. Even then I see people getting it wrong and trying to over-subscribe memory. The deficiency is in the boinc_client. No criticism of the client code, it was probably never designed for the kinds of tasks we need to run. Even if it gets addressed it would take time to roll that new client out. OpenIFS, like most computational fluid dynamics codes, is memory-bandwidth limited (less-so single core speed). Starting as many tasks as available cores is not the way to maximise production of credit with OIFS tasks. Glenn Carver wrote:[...] there's an issue with the boinc client that it will start up as many tasks as free cores available to boinc. It does not respect the memory limit of the task, leaving it to volunteers like yourself to fix it. The problem with the client was unexpected and we're looking into workarounds we can put in place on the server to deal with this. [...] We've since found that LHC also hit this problem so we'll probably follow their approach to limiting tasks downloaded to machines.I haven't been at lhc@home for a while, so don't know what their approach looks like. But a limit on tasks in progress is not a good replacement for the desired limit on tasks which are executing. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Why do you have the <report_results_immediately/> line in there? |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
Even then I see people getting it wrong and trying to over-subscribe memory. Yeah... I gave a couple OOM reapers some exercise early on. My guideline is simple: If I have 5GB per task, it's fine. 4GB per task is not sufficient, even with a lot of swap. |
Send message Joined: 12 Apr 21 Posts: 318 Credit: 15,011,722 RAC: 7,015 |
Even then I see people getting it wrong and trying to over-subscribe memory. I agree with the 5GB guideline. I'd add that ~10GB RAM should be left for overhead. I'd argue that the following is an excellent starting point and suspect that most users may not be able to do more without going over the desired less than 5% failure rate. Assuming the PC has enough cores/threads, isn't used heavily (especially RAM) for other things, and BOINC is allowed to use all of the system RAM, the following maximum number of concurrent tasks per amount of RAM should be ran (applies only to current OIFS tasks): 8GB RAM system - 0 tasks 16GB RAM - 1 task 32GB RAM - 4 tasks 64GB RAM - 10 tasks *128GB RAM - 23 tasks *256GB RAM - 49 tasks *512GB RAM - 100 tasks * I have no experience with really high RAM systems but would try the same principle and adjust to stay under the 5% failure rate. |
Send message Joined: 4 Oct 19 Posts: 15 Credit: 9,174,915 RAC: 3,722 |
i have a 8GB machine that runs a single task at a time with 100% success rate. i have a 16GB machines that runs 2 tasks concurrently with 100% success rate. i believe any issue is not about absolute memory available, but how those machines run. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
i have a 8GB machine that runs a single task at a time with 100% success rate. I have a 64 GB machine with 16 cores Intel processor. I currently have 12 cores allocted to Boinc. I allow CPDN to run a maximum of 6 processes, but I limit oifs_43r3_bl tasks to only one at a time oifs_43r3_ps tasks to only five at a time oifs_43r3 tasks to only fice at a time I run five other projects: WCG (4), Einstein (1), Rosetta (3), MilkyWay (2), and Universe (2). The numbers in parenthesis are the maximum number of those I allow to run at a time (if they are all supplying work). These almost always run with 100% success rate. The Oifs tasks have never failed me. Once in a while the legacy CPDN tasks fail, but usually with problems like negative theta and such. I notice my machine often runs successfully on tasks that have several failures before they get assigned to me. For those who care, my machine is ID: 1511241 |
©2024 cpdn.org