climateprediction.net (CPDN) home page
Thread 'Multithread - why not?'

Thread 'Multithread - why not?'

Message boards : Number crunching : Multithread - why not?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70203 - Posted: 26 Jan 2024, 19:09:03 UTC

Why can't we have multicore tasks? The work is obviously splittable into sections, as we all get a bit each. Presumably I'm doing one 25km square, and you're doing the neighbouring one. Can't we then split those 25km squares into smaller sections and put one on each core? I can't see the difference between splitting between users as we currently do and splitting between cores. In fact splitting between cores should be better, because if necessary they can feed each other information as they go along.
ID: 70203 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,490,541
RAC: 15,784
Message 70205 - Posted: 26 Jan 2024, 21:03:40 UTC - in response to Message 70203.  
Last modified: 26 Jan 2024, 21:04:52 UTC

We have a multi core version of OpenIFS in testing.

It doesn't work the way you suggest. Everyone gets the same region, just different starting data.

Why can't we have multicore tasks? The work is obviously splittable into sections, as we all get a bit each. Presumably I'm doing one 25km square, and you're doing the neighbouring one. Can't we then split those 25km squares into smaller sections and put one on each core? I can't see the difference between splitting between users as we currently do and splitting between cores. In fact splitting between cores should be better, because if necessary they can feed each other information as they go along.

---
CPDN Visiting Scientist
ID: 70205 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70207 - Posted: 26 Jan 2024, 21:31:26 UTC - in response to Message 70205.  

Is the multicore using many different starting datas, so working the same as lots of single core tasks, or is it doing something cleverer? I read a long time ago the processing for this project is extremely linear.
ID: 70207 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70208 - Posted: 26 Jan 2024, 22:10:14 UTC - in response to Message 70207.  

Using many different starting dates is not really very different from the current system of using cores to run several tasks at once. The multi core tasks will be OIFS and completing single tasks much faster than single cores could manage. I see this as potentially very useful when researchers want results in a hurry.
ID: 70208 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70209 - Posted: 26 Jan 2024, 22:46:45 UTC - in response to Message 70208.  

I wrote datas not dates, although datas may not be a word!

So what I thought I'd read a long time ago about these tasks being completely linear is incorrect?
ID: 70209 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70210 - Posted: 26 Jan 2024, 23:12:57 UTC - in response to Message 70209.  

So what I thought I'd read a long time ago about these tasks being completely linear is incorrect?
It was true for the met office code. The rules have changed with OIFS. Or at least that is my understanding.
ID: 70210 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70211 - Posted: 26 Jan 2024, 23:14:22 UTC

I wrote datas not dates, although datas may not be a word!


What comes of looking at the forum on my phone!

Data is plural. the singular is datum.
ID: 70211 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70212 - Posted: 26 Jan 2024, 23:53:03 UTC - in response to Message 70211.  
Last modified: 26 Jan 2024, 23:53:39 UTC

I wrote datas not dates, although datas may not be a word!
What comes of looking at the forum on my phone!

Data is plural. the singular is datum.
Alledgedly yes, although nobody uses it that way, like we all say "1 dice". Anyway in this case, "data" is referring to a set of data feeding the task. So I guess I should have said "sets of data", like fields of cattle. I needed a plural of a plural.

Why is zero plural?
ID: 70212 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,490,541
RAC: 15,784
Message 70215 - Posted: 27 Jan 2024, 16:46:12 UTC - in response to Message 70207.  
Last modified: 27 Jan 2024, 16:47:55 UTC

Hi Peter, I'm not quite sure what you mean by 'linear' here? Anyway, a multicore app will still use a single starting data, but the computation is split over multiple cores (as you suggested in your first message). For example, the code is full of loops over the "total_gridpoints". In the multicore version, cpu 1 gets points '1 -> total_gridpoints/2', cpu 2 gets 'total_gridspoints/2 + 1 -> total_gridpoints'. It's always the same starting data but the model completes in less time.

The issue with multicore is how much parallelism can be extracted (I think this is what you mean by 'linear'). Any code will always have a fraction of its execution that can't be done in parallel (see Amdahl's law if interested in more detail). File I/O is an obvious one, but there are other sections outside the main loops which are not parallelized. On a home PC I can't get the model to parallelize as well as it would do on a high-performance computer (which it was designed for). OpenIFS on 2 cores gives a speedup close to 2, but it gradually drops as the number of cores is increased.

I don't see any value going beyond 2 cores for OpenIFS, because otherwise we're not using cores as efficiently as running multiple serial tasks. e.g. if 4 cores means 10% of the time only 1 core can execute that's wasting 3 cores. We will probably use the multicore app for the longer running tasks to get them completed quicker and reduce the chance that they might be aborted or fail.

From a technical point of view OpenIFS (like the Hadley models) uses two methods of parallelization; OpenMP and MPI. OpenMP is what we use for CPDN as it's restricted to shared memory computers. MPI is for use with distributed memory computers. It also works with shared memory but I remove it for CPDN to lower the memory overhead.

Is the multicore using many different starting datas, so working the same as lots of single core tasks, or is it doing something cleverer? I read a long time ago the processing for this project is extremely linear.

---
CPDN Visiting Scientist
ID: 70215 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70216 - Posted: 27 Jan 2024, 17:08:47 UTC - in response to Message 70215.  

Thanks for the thorough explanation, I understand now.
ID: 70216 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,025,219
RAC: 69,000
Message 70217 - Posted: 27 Jan 2024, 20:35:41 UTC - in response to Message 70215.  
Last modified: 27 Jan 2024, 20:38:04 UTC

I don't see any value going beyond 2 cores for OpenIFS, because otherwise we're not using cores as efficiently as running multiple serial tasks. e.g. if 4 cores means 10% of the time only 1 core can execute that's wasting 3 cores. We will probably use the multicore app for the longer running tasks to get them completed quicker and reduce the chance that they might be aborted or fail.

There is actually a reason for OpenIFS to go beyond: the memory limit. Given the 5GB per WU requirement in previous workloads and the desire to increase resolution, a common client system with 16G/32G/64G memory can only run very few WUs. However, they can easily come with 8/16 cores these days, not counting SMT. The loss of efficiency of scaling up (e.g. 4-8 threads) is moot if all those additional CPU cores/threads will be idling due to memory limit anyway.

Based on my own development experience, going from ST to MT is major work but then scaling up number of threads is usually easier functionally, even though it may not be efficient. If it's not a lot of work to support more than two threads, it would be better to validate higher thread counts anyway. Ideally, server side can support customization of number of threads but IMO it would be totally fine for CPDN to default to single/dual threads WUs and leave multi-thread configurations to app_configs. That way, by default, the workload would operate efficiently, but for folks are bottlenecked on memory, or prioritize CPDN project, they can customize their thread count as necessary.
ID: 70217 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70218 - Posted: 27 Jan 2024, 21:19:35 UTC
Last modified: 27 Jan 2024, 21:22:24 UTC

I think most people run other projects aswell, so the easiest solution is as Glenn described, 2 threads. When the machine is low on RAM, it will run other projects on the other threads.

I don't know if this can be done server side, but what I do in app configs is if a project hands out 8 thread tasks which average 5 threads in use, I say so in app config, so the boinc scheduler runs more at once. For example:

    <app_version>
        <app_name>whatever</app_name>
        <plan_class>mt</plan_class>
        <cmdline>--nthreads 8</cmdline>
        <avg_ncpus>5</avg_ncpus>
    </app_version>
ID: 70218 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,490,541
RAC: 15,784
Message 70219 - Posted: 28 Jan 2024, 21:30:22 UTC - in response to Message 70218.  
Last modified: 28 Jan 2024, 21:30:50 UTC

Yes exactly, I assumed most people might run 1 or 2 of the higher memory multicore tasks and fill up the rest of the cores (if they wanted) with other projects which take minimal memory. We will configure the server to only allow 1 or 2 tasks in progress per host. We have to do this as unfortunately there is a bug in the boinc client which doesn't take the memory requirement of a task into account when deciding whether to start another task or not. This means the client could start more tasks than the computer can handle. LHC have the same restriction.

Note that we don't intend to allow users to modify the number of threads via the app_config file for multicore OpenIFS.
---
CPDN Visiting Scientist
ID: 70219 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70220 - Posted: 28 Jan 2024, 21:42:37 UTC - in response to Message 70219.  
Last modified: 28 Jan 2024, 21:43:53 UTC

Yes exactly, I assumed most people might run 1 or 2 of the higher memory multicore tasks and fill up the rest of the cores (if they wanted) with other projects which take minimal memory. We will configure the server to only allow 1 or 2 tasks in progress per host. We have to do this as unfortunately there is a bug in the boinc client which doesn't take the memory requirement of a task into account when deciding whether to start another task or not. This means the client could start more tasks than the computer can handle. LHC have the same restriction.
No they don't. LHC tasks only refuse to load if there's not enough RAM, and Boinc works it out correctly. On my machines with 128GB RAM, they fill the threads completely. In fact it will not even download more tasks if the RAM is overloaded. The server knows what the RAM use will be.

Note that we don't intend to allow users to modify the number of threads via the app_config file for multicore OpenIFS.
I don't often do that, it's mainly the <avg_ncpus> I use to tell the scheduler how many threads are actually in use on average, so it can load the right amount of tasks. I only command the task to use less for example for Milkyway Nbody which defaults to 16 threads if the computer has that many. But on a 24 thread machine it's best to tell two of them to use 12 each.
ID: 70220 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,490,541
RAC: 15,784
Message 70221 - Posted: 28 Jan 2024, 23:25:33 UTC - in response to Message 70220.  
Last modified: 28 Jan 2024, 23:26:40 UTC

Yes exactly, I assumed most people might run 1 or 2 of the higher memory multicore tasks and fill up the rest of the cores (if they wanted) with other projects which take minimal memory. We will configure the server to only allow 1 or 2 tasks in progress per host. We have to do this as unfortunately there is a bug in the boinc client which doesn't take the memory requirement of a task into account when deciding whether to start another task or not. This means the client could start more tasks than the computer can handle. LHC have the same restriction.
No they don't. LHC tasks only refuse to load if there's not enough RAM, and Boinc works it out correctly. On my machines with 128GB RAM, they fill the threads completely. In fact it will not even download more tasks if the RAM is overloaded. The server knows what the RAM use will be.

I'm referring to the client with regard to memory handling once tasks have been delivered to the client. We've had conversations with the LHC team and they confirmed they restrict the number of 'in progress' tasks on clients to workaround the boinc client memory issue.

The boinc client only checks memory limits once the tasks are running. It does not check memory requirement set in the task XML before starting the task. We discovered this early on with OpenIFS where multiple tasks would start up which exceeded the available RAM, even though we correctly specified the minimum memory needed in the task XML. It's a known issue with the boinc client, we've had conversation with LHC and David Anderson about it. If you don't believe me, I can show you the relevant boinc code.

Boinc was not designed with large memory tasks in mind.
---
CPDN Visiting Scientist
ID: 70221 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70222 - Posted: 28 Jan 2024, 23:43:59 UTC - in response to Message 70221.  
Last modified: 28 Jan 2024, 23:44:34 UTC

I'm referring to the client with regard to memory handling once tasks have been delivered to the client. We've had conversations with the LHC team and they confirmed they restrict the number of 'in progress' tasks on clients to workaround the boinc client memory issue.
I don't know who gave you that misinformation, but LHC place no such restriction. If I run a computer on LHC, it will download and start LHC on every thread until it's low on RAM, then the client chooses a lower weighted project for the remainder of the threads. It's automatic and works perfectly.

The boinc client only checks memory limits once the tasks are running. It does not check memory requirement set in the task XML before starting the task.
I often get tasks saying "waiting for memory". The client knows not to start them. Perhaps it just goes by current tasks using 80% of it, but it works, so there is nothing to be concerned about. Boinc will not run too many of your tasks.
ID: 70222 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,025,219
RAC: 69,000
Message 70228 - Posted: 29 Jan 2024, 22:49:21 UTC
Last modified: 29 Jan 2024, 23:00:22 UTC

Here we go again... Based on my previous experiment, both of you are correct and wrong. It got a bit contentious back then, but perhaps I can rephrase it better this time. This is based on both code and my local experiment monitoring the task and boinccmd output on per-second interval.

1. Boinc client does respect rsc_memory_bound set in the task spec, only when deciding whether it can start this specific task. This can be trivially verified by setting allowed boinc memory below rsc_memory_bound and the task will never start.
2. However, once a task has started, boinc client monitors its RSS, averaged across a fixed interval (30 seconds from my observation) as the memory usage of the task. This average number is printed out in "boinccmd --get_tasks" as "working set size". This average usage is then subtracted from the total allowed memory to decide how much free memory is left for the client to schedule next task. The scheduling follows the first rule above and only uses rsc_memory_bound of the next task it intend to start. Running tasks' rsc_memory_bound are ignored.
3. Only when the sum of average usage exceeds allowed memory, will boinc client preempt tasks for the lack of memory. However, when "leave task in memory when suspended" is set, such preemption does nothing to reduce memory usage.

Here is a concrete example. Say we allow 12G to be used by BOINC on a 16GB host and set the rsc_memory_bound at 8G for each task. Let's also say OpenIFS task's memory usage never average below 5GB over any 30 second window. The second task will not be able to start, waiting for memory forever. Tweaking the allowed memory to 13G+ though will suddenly allow the second task to start at some point. Or if the running task memory average temporarily dips to 4GB over any window, boinc client will start the next task too. Depending on people's setting and the magnitude of memory fluctuation over the monitoring window, one can observe either behavior on the same host or even boinc configuration.

This behavior is problematic for OpenIFS due to its wide swing of memory usage. It allows boinc client to start too many tasks when the running tasks happen to be low on memory usage over a past monitoring window. Then when these tasks all peak in their memory usage together later, they quickly overshoot the allowed memory usage. Combined with the recommended setting of "leave task in memory", even preempting tasks won't help to bring the memory usage down. If there isn't enough buffer on host above allowed memory to absorb the spike, the host OOMs and kernel kills some tasks.

In summary, boinc client can handle large memory tasks, but it can't handle large fluctuation of task memory usage. LHC, especially ATLAS native also has this problem because there is a 20-minute ish window at the beginning before memory are fully allocated and calculation is ramped up. They can quickly get screwed there as boinc client will happily start the next tasks.

On the bright side, this behavior is possible to exploit to improve throughput for OpenIFS. The more tasks you run concurrently, the less likely the worst case scenario of all tasks peaking memory usage at the same time will occur, especially if they are splayed initially. I played with Oracle cloud last time where they allowed VM memory configuration with precision of a single GB. For 16 tasks, I get away with 4.5GB per task. For 8 tasks, I needed 5GB per task. For 4 tasks, I needed 5.5GB per task.

I heard LHC maintains the client code now? IMO, one potential quick fix on the client side is to change monitoring algorithm from average over 30 seconds to max over 5-10 min. It will likely fix the problem for OpenIFS, but ironically won't be enough to help ATLAS. Most other projects shouldn't be affected either way because they have constant memory usage.
ID: 70228 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70229 - Posted: 29 Jan 2024, 23:28:23 UTC - in response to Message 70228.  

Thank for the clarification. However:

Combined with the recommended setting of "leave task in memory", even preempting tasks won't help to bring the memory usage down. If there isn't enough buffer on host above allowed memory to absorb the spike, the host OOMs and kernel kills some tasks.
But those are just paged to disk by the OS.

I heard LHC maintains the client code now? IMO, one potential quick fix on the client side is to change monitoring algorithm from average over 30 seconds to max over 5-10 min. It will likely fix the problem for OpenIFS, but ironically won't be enough to help ATLAS. Most other projects shouldn't be affected either way because they have constant memory usage.
I don't seem to have a problem with ATLAS. BOINC easily pauses some if they get too big.

And I stand by what I said, LHC do not prohibit you running as many tasks as you like.
ID: 70229 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,025,219
RAC: 69,000
Message 70230 - Posted: 30 Jan 2024, 3:14:53 UTC - in response to Message 70229.  

But those are just paged to disk by the OS.

If active working set gets paged to disk, the performance will suffer greatly. OpenIFS doesn't seem to have major memleak, so whatever gets paged to the disk is going to be fetched back in soon, and that's all wasted work compared to keep them in memory. This is assuming you have a big enough swap file/partition at first place, which is not the case for most Linux distros by default. They usually set up a very small swap relative to total memory. This is unlike Windows which automatically scales up the page file by default, and it can go multiple times larger than actual memory.

The main reason for this is different memory management philosophy. On Windows, when application asks for X GB, Windows guarantees you'd get X GB, so it needs a huge swap file to back it up even if those memory may never be used. On Linux, it will happily say yes but guarantee you nothing until application starts writing to them. Therefore, Linux generally doesn't need a huge swap file. Whatever gets allocated were written for at least once. So long as application is not leaking left and right, the allocated memory should be fairly close to the active working set.

While it may sound very stupid for an application to allocate memory that it never use, this is usually implicit through fork. Fork inherits all address space and without explicit clean up, those space will never get touched again by the new process. On Windows those space go to swap and you'd better have a big enough swap. On Linux, they never get allocated at all because they were never written. My pick for top offender of this pattern is Python multi-process library, where by definition it has to keep the entire address space while giving no way for user to clean up even if they want. :-)

LHC do not prohibit you running as many tasks as you like.

LHC has multiple applications and I wonder if we are talking about the same thing. For ATLAS, IIRC, it's 2x or 4x of Max # CPUs set in website preference. Even if you set that to unlimited, you won't get more than 40 tasks in flight per client whatsoever. Theory doesn't seem to have any limit and will happily fill your job cache.
ID: 70230 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70231 - Posted: 30 Jan 2024, 3:33:50 UTC - in response to Message 70230.  

But those are just paged to disk by the OS.
If active working set gets paged to disk, the performance will suffer greatly.
But the OS will page the inactive tasks in preference.

OpenIFS doesn't seem to have major memleak, so whatever gets paged to the disk is going to be fetched back in soon, and that's all wasted work compared to keep them in memory. This is assuming you have a big enough swap file/partition at first place, which is not the case for most Linux distros by default.
Presumably you're told when it's not big enough?

LHC do not prohibit you running as many tasks as you like.
LHC has multiple applications and I wonder if we are talking about the same thing. For ATLAS, IIRC, it's 2x or 4x of Max # CPUs set in website preference. Even if you set that to unlimited, you won't get more than 40 tasks in flight per client whatsoever. Theory doesn't seem to have any limit and will happily fill your job cache.
I will have set everything to max on the website. I run ATLAS, CMS, and Theory. I don't have a machine with more than 24 threads. The 40 limit will never apply, since ATLAS can use 8 threads, so 40x8 is a lot more than any machine has with current technology.
ID: 70231 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Multithread - why not?

©2024 cpdn.org