climateprediction.net (CPDN) home page
Thread 'How to Prevent OpenIFS Download'

Thread 'How to Prevent OpenIFS Download'

Message boards : Number crunching : How to Prevent OpenIFS Download
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Brummig

Send message
Joined: 3 Nov 05
Posts: 26
Credit: 687,388
RAC: 529
Message 67987 - Posted: 23 Jan 2023, 14:56:22 UTC

Is it possible to stop my host from downloading OpenIFS tasks, other than by setting No New Tasks for CPDN? The virtual memory use (disk thrashing) brings my host almost to a standstill, and even if I let it run the trickles don't upload.
ID: 67987 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,730,030
RAC: 12,757
Message 67990 - Posted: 23 Jan 2023, 15:33:03 UTC - in response to Message 67987.  
Last modified: 23 Jan 2023, 15:37:50 UTC

If you are seeing disk thrashing because the machine is swapping, the machine is running too many OpenIFS tasks. Unfortunately, there's an issue with the boinc client that it will start up as many tasks as free cores available to boinc. It does not respect the memory limit of the task, leaving it to volunteers like yourself to fix it. The problem with the client was unexpected and we're looking into workarounds we can put in place on the server to deal with this. You can either adjust your percentage CPUs in boincmgr to reduce the available cpus, or, use an app_config.xml file in the project directory (example below). We've since found that LHC also hit this problem so we'll probably follow their approach to limiting tasks downloaded to machines.

An app config is a nice way of controlling exactly how many tasks the client is allowed to run at any time (irrespective of how many tasks are downloaded). The file is specific to a project and should be placed in the /var/lib/boinc/projects/climateprediction.net directory (or wherever your boinc software is installed). Mine looks like this. I set a max of 6 tasks in total across all CPDN apps, and for each OpenIFS app variant, no more than 6 tasks at a time. Each task takes ~5Gb memory so make sure you have enough free RAM and adjust the values below to fit.

<app_config>
   <project_max_concurrent>6</project_max_concurrent>
   <report_results_immediately/>
   <app>
      <name>oifs_43r3</name>
      <max_concurrent>6</max_concurrent>
   </app>
   <app>
      <name>oifs_43r3_ps</name>
      <max_concurrent>6</max_concurrent>
   </app>
   <app>
      <name>oifs_43r3_bl</name>
      <max_concurrent>6</max_concurrent>
   </app>
</app_config>


Hope that helps. There's plenty of info about app_config.xml files online if you want to find out more, or ask on the forums as plenty of people know about them.

Edit: sorry forgot to answer your question about stopping downloads. You can pause the project and that will stop. However, CPDN have paused the batch server so noone will be getting any more tasks for the time being as there's a backlog of data to be dealt with.

Is it possible to stop my host from downloading OpenIFS tasks, other than by setting No New Tasks for CPDN? The virtual memory use (disk thrashing) brings my host almost to a standstill, and even if I let it run the trickles don't upload.
ID: 67990 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 67992 - Posted: 23 Jan 2023, 15:39:10 UTC - in response to Message 67987.  

Short answer, No. CPDN used like some other projects to allow one to choose which model types you run on a particular host the website but no longer does. You are right at the minimum RAM for these tasks. If you are running more than two at once you will be using swap a lot. I would suggest lowering the number of cores in use to one if you have anything above minimal non-boinc usage. The, "Up[loads are stuck" thread contains details of the saga of lack of disk space/transfer speed to backup storage/tape drive failures. Also OIFS is likely to be the majority of tasks for the short and medium term with others only appearing very occasionally.
ID: 67992 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,768,004
RAC: 3,189
Message 67993 - Posted: 23 Jan 2023, 15:39:40 UTC - in response to Message 67990.  

Hope that helps. There's plenty of info about app_config.xml files online if you want to find out more, or ask on the forums as plenty of people know about them.
The official manual is on the BOINC website, at https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration
ID: 67993 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 9 Mar 22
Posts: 30
Credit: 1,065,239
RAC: 556
Message 67994 - Posted: 23 Jan 2023, 15:56:01 UTC - in response to Message 67987.  

You may also need to upgrade your BOINC client to at least 7.20.x since older versions suffer from a bug related to the 'max_concurrent' options.
See:
https://github.com/BOINC/boinc/pull/4592
ID: 67994 · Report as offensive     Reply Quote
Brummig

Send message
Joined: 3 Nov 05
Posts: 26
Credit: 687,388
RAC: 529
Message 67996 - Posted: 23 Jan 2023, 16:34:52 UTC

Thank you for your quick replies. Dave Jackson spotted the problem. At one point, whilst I was using the computer, BOINC quietly downloaded and ran four OpenIFS tasks simultaneously, plus the two hadam4 tasks it was already running, which was hilarious. At the moment the host is still crunching on one hadam4 (along with other, much less memory-intensive non CPDN tasks), so one OpenIFS is way too many. Is it valid to set <max_concurrent> for each of the OpenIFS apps to 0 (I can set it to 1 and try again once the hadam4 has completed)? The official manual doesn't say.
ID: 67996 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 67997 - Posted: 23 Jan 2023, 16:35:52 UTC

Huh. I was wondering why that didn't seem to be working right.

Looks like 7.16 is what's in the 20.04 repos. I suppose I should upgrade my boxes, may as well, not like they're doing much work...
ID: 67997 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 16,097,824
RAC: 62,157
Message 67999 - Posted: 23 Jan 2023, 16:44:38 UTC - in response to Message 67996.  

Is it valid to set <max_concurrent> for each of the OpenIFS apps to 0 (I can set it to 1 and try again once the hadam4 has completed)? The official manual doesn't say.
0 is used to indicate no limit, so will try and run as many as <project_max_concurrent> allows or as many as the client thinks it can run if that isn't set (or is also set to 0).
ID: 67999 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 68000 - Posted: 23 Jan 2023, 16:46:41 UTC
Last modified: 23 Jan 2023, 16:50:55 UTC

OpenIFS apps to 0 (I can set it to 1 and try again once the hadam4 has completed)? The official manual doesn't say.
Richard probably knows. There are a number of things where BOINC treats "0" as meaning no restriction which would make my choice setting it to 1 in your situation.

Edit: The server uses -1 to indicate a blacklisted computer that will not get any tasks. (CPDN used to use this to stop machines without the 32bit libraries which crashed everything getting work but hasn't recently.) So that may work to indicate not running any tasks of a particular type.
ID: 68000 · Report as offensive     Reply Quote
Brummig

Send message
Joined: 3 Nov 05
Posts: 26
Credit: 687,388
RAC: 529
Message 68002 - Posted: 23 Jan 2023, 16:55:01 UTC - in response to Message 68000.  
Last modified: 23 Jan 2023, 16:56:36 UTC

OK, looks like I've no choice but to opt for No New Tasks (Edit: Or try -1 :)).

And yes, it looks like the repo for Ubuntu 20.04 LTS needs updating.
ID: 68002 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 68005 - Posted: 23 Jan 2023, 18:36:54 UTC - in response to Message 68002.  

And yes, it looks like the repo for Ubuntu 20.04 LTS needs updating.


Richard has recently posted instructions for using Gianfranco's repository which while not official is in general pretty reliable. It is a much simpler option than compiling your own which is what I do.
ID: 68005 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,768,004
RAC: 3,189
Message 68007 - Posted: 23 Jan 2023, 18:55:10 UTC - in response to Message 68005.  

Richard has recently posted instructions...
That was message 67761, and the person I was advising seemed happy with the instructions on the page I suggested.
ID: 68007 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 68008 - Posted: 23 Jan 2023, 19:09:43 UTC - in response to Message 68007.  

Richard has recently posted instructions...
That was message 67761, and the person I was advising seemed happy with the instructions on the page I suggested.


Oh, great! Yeah, that's easy enough to toss in. I should probably update them to 22.04 anyway, though. Now's as good a time as any, they're just chewing on WCG tasks when they get any.
ID: 68008 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 68009 - Posted: 23 Jan 2023, 20:54:09 UTC - in response to Message 67990.  
Last modified: 23 Jan 2023, 21:07:21 UTC

Glenn Carver wrote:
[...] there's an issue with the boinc client that it will start up as many tasks as free cores available to boinc. It does not respect the memory limit of the task, leaving it to volunteers like yourself to fix it. The problem with the client was unexpected and we're looking into workarounds we can put in place on the server to deal with this. [...] We've since found that LHC also hit this problem so we'll probably follow their approach to limiting tasks downloaded to machines.
I haven't been at lhc@home for a while, so don't know what their approach looks like. But a limit on tasks in progress is not a good replacement for the desired limit on tasks which are executing.

Stages of a "task in progress":
(ready to send)
– assigned to host
– downloading
– ready to run
– executing
– uploading
– ready to report
(reported)

Each of the stages can take unpredictably long for a variety of reasons. Hence it's clear that # in progress cannot control # executing very well, to put it mildly.

Also, oifs_43r3_ps concurrency is only part of the equation. The other part is what else is going on on the host. It is a big difference if the host is running a desktop environment or is a dedicated cruncher.
ID: 68009 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,730,030
RAC: 12,757
Message 68046 - Posted: 25 Jan 2023, 14:52:27 UTC - in response to Message 68009.  

Unfortunately we have to work with what we can. It's a workaround to have these controls in place, but at least there is something we can do. It's also clear if we 'do nothing' we end up with chaos on volunteer machines who do not (and why should they?) have app_config files in place. Even then I see people getting it wrong and trying to over-subscribe memory. The deficiency is in the boinc_client. No criticism of the client code, it was probably never designed for the kinds of tasks we need to run. Even if it gets addressed it would take time to roll that new client out.

OpenIFS, like most computational fluid dynamics codes, is memory-bandwidth limited (less-so single core speed). Starting as many tasks as available cores is not the way to maximise production of credit with OIFS tasks.

Glenn Carver wrote:
[...] there's an issue with the boinc client that it will start up as many tasks as free cores available to boinc. It does not respect the memory limit of the task, leaving it to volunteers like yourself to fix it. The problem with the client was unexpected and we're looking into workarounds we can put in place on the server to deal with this. [...] We've since found that LHC also hit this problem so we'll probably follow their approach to limiting tasks downloaded to machines.
I haven't been at lhc@home for a while, so don't know what their approach looks like. But a limit on tasks in progress is not a good replacement for the desired limit on tasks which are executing.

Stages of a "task in progress":
(ready to send)
– assigned to host
– downloading
– ready to run
– executing
– uploading
– ready to report
(reported)

Each of the stages can take unpredictably long for a variety of reasons. Hence it's clear that # in progress cannot control # executing very well, to put it mildly.

Also, oifs_43r3_ps concurrency is only part of the equation. The other part is what else is going on on the host. It is a big difference if the host is running a desktop environment or is a dedicated cruncher.
ID: 68046 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68047 - Posted: 25 Jan 2023, 17:10:14 UTC - in response to Message 67990.  

Why do you have the
<report_results_immediately/>
line in there?
ID: 68047 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 68053 - Posted: 25 Jan 2023, 23:05:53 UTC - in response to Message 68046.  

Even then I see people getting it wrong and trying to over-subscribe memory.


Yeah... I gave a couple OOM reapers some exercise early on.

My guideline is simple: If I have 5GB per task, it's fine. 4GB per task is not sufficient, even with a lot of swap.
ID: 68053 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,011,722
RAC: 7,015
Message 68054 - Posted: 26 Jan 2023, 7:09:46 UTC - in response to Message 68053.  
Last modified: 26 Jan 2023, 7:12:21 UTC

Even then I see people getting it wrong and trying to over-subscribe memory.



Yeah... I gave a couple OOM reapers some exercise early on.

My guideline is simple: If I have 5GB per task, it's fine. 4GB per task is not sufficient, even with a lot of swap.

I agree with the 5GB guideline. I'd add that ~10GB RAM should be left for overhead. I'd argue that the following is an excellent starting point and suspect that most users may not be able to do more without going over the desired less than 5% failure rate. Assuming the PC has enough cores/threads, isn't used heavily (especially RAM) for other things, and BOINC is allowed to use all of the system RAM, the following maximum number of concurrent tasks per amount of RAM should be ran (applies only to current OIFS tasks):

8GB RAM system - 0 tasks
16GB RAM - 1 task
32GB RAM - 4 tasks
64GB RAM - 10 tasks

*128GB RAM - 23 tasks
*256GB RAM - 49 tasks
*512GB RAM - 100 tasks

* I have no experience with really high RAM systems but would try the same principle and adjust to stay under the 5% failure rate.
ID: 68054 · Report as offensive     Reply Quote
Vato

Send message
Joined: 4 Oct 19
Posts: 15
Credit: 9,174,915
RAC: 3,722
Message 68068 - Posted: 27 Jan 2023, 0:18:28 UTC - in response to Message 68054.  

i have a 8GB machine that runs a single task at a time with 100% success rate.
i have a 16GB machines that runs 2 tasks concurrently with 100% success rate.
i believe any issue is not about absolute memory available, but how those machines run.
ID: 68068 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68069 - Posted: 27 Jan 2023, 1:36:01 UTC - in response to Message 68068.  

i have a 8GB machine that runs a single task at a time with 100% success rate.
i have a 16GB machines that runs 2 tasks concurrently with 100% success rate.
i believe any issue is not about absolute memory available, but how those machines run.


I have a 64 GB machine with 16 cores Intel processor.

I currently have 12 cores allocted to Boinc.

I allow CPDN to run a maximum of 6 processes, but I limit
oifs_43r3_bl tasks to only one at a time
oifs_43r3_ps tasks to only five at a time
oifs_43r3 tasks to only fice at a time

I run five other projects: WCG (4), Einstein (1), Rosetta (3), MilkyWay (2), and Universe (2). The numbers in parenthesis are the maximum number of those I allow to run at a time (if they are all supplying work).

These almost always run with 100% success rate. The Oifs tasks have never failed me. Once in a while the legacy CPDN tasks fail, but usually with problems like negative theta and such. I notice my machine often runs successfully on tasks that have several failures before they get assigned to me.

For those who care, my machine is ID: 1511241
ID: 68069 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : How to Prevent OpenIFS Download

©2024 cpdn.org