Message boards : Number crunching : New work discussion - 2
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Please be patient. The new models will get get here when they're ready. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,706,621 RAC: 9,524 |
A couple of unfinished points, continuing from the previous thread - specifically in reply to Glenn Carver's message 66054. 1) I don't think MilkyWay has a separate app version for each core count. Something like that would normally be handled by the plan_class mechanism, and the MilkyWay applications page only shows two application versions for the N-Body simulation - one for Windows, and the other for Linux. Both have the same simple [mt] plan_class. 2) I've set up a basic machine to run MilkyWay nbody tasks, and tracked the messages passing between the machine, the server, and the running science app. I think I've got a possible explanation of how they've done it. a) the machine is a small 4-core Intel, no hyperthreading, running Windows 10. I've set it, via local preferences, to use 80% of the available CPUs. That calculation is done in integer maths, so the machine has three cores available. b) the request file from the machine to the server contains these lines: <working_global_preferences> <global_preferences> <max_ncpus_pct>80.000000</max_ncpus_pct> </global_preferences> </working_global_preferences> <host_info> <p_ncpus>4</p_ncpus> </host_info>- so the local settings are reported to the server: "use 80% of 4 CPUs". c) The reply from the server, when new work is allocated, contains these lines: <app_version> <app_name>milkyway_nbody</app_name> <avg_ncpus>3.000000</avg_ncpus> </app_version>d) The allocated tasks are shown in BOINC Manager, and marked (3 CPUs) 3) When the BOINC client starts a new task, it populates an empty slot directory with the required files, and also creates its own file called "init_data.xml". That contains the lines: <ncpus>3.000000</ncpus> <host_info> <p_ncpus>4</p_ncpus> </host_info> "init_data.xml" can be read by the BOINC API library linked into a project app at compile time. I think that's how the app must be getting its threading instructions: "Although the processor has 4 cores (p_ncpus), only use 3 of them (ncpus). I can't find any other way it can be passed, and I've eyeballed every single occurrence of the digit '3' in these files. And still it reports "Using OpenMP 3 max threads on a system with 4 processors". |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
Will the new work have user-friendly checkpointing? I sure would love to run climate & weather models. I searched for "checkpoint" and found nothing about it. As I recall checkpoints were 4 hours or so apart. That makes it very difficult to deal with heatwaves and TOU metering. Also looks like CPDN wants to use every CPU thread on your computer. Hopefully they'll fix that bug too. Edit: Found some info but not sure how old it is: https://www.climateprediction.net/getting-started/support/technical-faq/#no_tasks_available How long does a Timestep take in real time? "A Timestep represents a 1/2 hour of model time (not realtime)." "Climateprediction.net checkpoints every 144 Timesteps..." How do we make backups of a WU in-progress? "More worrying is that a computation error loses more work. What is the appropriate reaction to this? Complaining is unlikely to be useful as trying to make the Work Unit smaller has been considered and rejected as not practical. A better reaction would be to decide to make a backup from time to time so if you do suffer an error, you can recover without losing too much work." |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Aurum These are things that we'll learn about when they arrive. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,017,270 RAC: 20,902 |
How do we make backups of a WU in-progress?Making backups is a hangover from when tasks often took 9 months to complete. The time taken to back up individual tasks really isn't worth it these days. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Also looks like CPDN wants to use every CPU thread on your computer. Hopefully they'll fix that bug too. I am runnng Red Hat Enterprise Linux release 8.6 (Ootpa) on mu Linux box: Computer 1511241 CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.6 (Ootpa) [4.18.0-372.19.1.el8_6.x86_64|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.28 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 482.7 GB Measured floating point speed 6.58 billion ops/sec Measured integer speed 30.58 billion ops/sec Average upload rate 738.83 KB/sec Average download rate 25591.7 KB/sec Average turnaround time 2.47 days With this in there, it does not use all the processors for CPDN, but only the 4 specified. [/var/lib/boinc/projects/climateprediction.net]$ cat app_config.xml <app_config> <project_max_concurrent>4</project_max_concurrent> </app_config> |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
"init_data.xml" can be read by the BOINC API library linked into a project app at compile time. I think that's how the app must be getting its threading instructions: "Although the processor has 4 cores (p_ncpus), only use 3 of them (ncpus). I can't find any other way it can be passed, and I've eyeballed every single occurrence of the digit '3' in these files. And still it reports "Using OpenMP 3 max threads on a system with 4 processors".Richard, thanks. That suggests MilkyWay will always run an app that fits into the available CPUs (which I think I've seen it do on my machine). For OpenIFS, that approach may not work. We will have 1-4 core versions available. If the init_data.xml tells the client I'm making 8 cores out of 16 total on my machine available, then the client will give OpenIFS wrapper code the wrong number. We'll probably have to use a different approach then to encode the correct number of threads to use. There's also the project preferences to consider when CPDN add in the ability for the user to restrict apps to below a certain core count. I am not sure how that mechanism works. Quite a few boinc issues to deal with before we can get the multicore work out to everyone. I don't want to populate this thread with a technical discussion, perhaps we can take this offline if need be. Many thanks for digging into MilkyWay's setup - that's useful. Cheers, Glenn |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
The CPDN models create restart dumps at frequent intervals (which we configure on the server side). If the machine is powered down or boinc shutdown, the model restarts from these dumps when the client is restarted. There's absolutely no need to create your own backups of the work units.How do we make backups of a WU in-progress?Making backups is a hangover from when tasks often took 9 months to complete. The time taken to back up individual tasks really isn't worth it these days. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Planned OpenIFS configurations and memory Some info on memory requirements on upcoming OpenIFS forecasts. As mentioned previously, we're aiming to increase the model resolution to be more scientifically valuable. These resolutions come with higher memory requirements:
O96 grid, 100km " . Peak RAM = 10Gb N128 grid, 78km " . Peak RAM = 19Gb O160 grid, 61km " . Peak RAM = 24Gb All the above use 91 model levels. Previously CPDN has only used the 125km version with 60 model levels. Obviously these will be significantly more demanding than seen previously (I mentioned there will be additional credit for these and we'll use multicore for the higher resolutions). I would hope the first two to fit in 16Gb machines, the others will need 32Gb minimum (assuming of course people want to run these). Only machines which specify enough resource will get workunits. Timescale for testing is the next couple of months. --- CPDN Visiting Scientist |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
As I recall checkpoints were 4 hours or so apart. That makes it very difficult to deal with heatwaves and TOU metering. Task suspend/resume, with the task remaining in memory, seems to work just fine. I've not had issues with this. Suspending the entire machine also works fine. My compute nodes are solar powered in my office, so they all sleep, every night, and I power them back on every morning. This doesn't cause any problems either - machine suspend/resume is invisible to tasks. The downside is that my stuff takes longer to complete than if it were running 24/7, but it's run entirely on surplus generation from an off grid system. I just try very hard not to crash the machines or tasks... "Suspend from the last checkpoint" has about a 50-75% success rate in my experience. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Here is some more data on my Linux machine running an nbody milkyway task. This is part of the associated init_data.xml file in the slots directory. <app_init_data> <ncpus>4.000000</ncpus> <---<<<This is the number of tasks a work unit may use. <host_info> <p_ncpus>16</p_ncpus> <----<<< This is the number of cores the machine has. </host_info> <app_file>milkyway_nbody_1.82_x86_64-pc-linux-gnu__mt</app_file> </app_init_data> For a more uisual single processor milkyway task, it says <app_init_data> <ncpus>1.000000</ncpus> <host_info> <p_ncpus>16</p_ncpus> <app_file>milkyway_1.46_x86_64-pc-linux-gnu</app_file> </app_init_data> So that is how they do it. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Yep, thanks. From Richard's earlier message, the server sends the app_version data with <avg_ncpus> set to > 1 (that's the key part) and the client creates the init_data.xml from this information when it starts the task, where the value from <avg_ncpus> is copied into <ncpus> (would be nice if the naming was consistent). I also need to understand how credit it worked out with multithreaded apps. i.e. is it just 4x 1 core credit or does it take the scaling efficiency into account. i.e. if 4 threads gives a 3.5 speedup, is credit then 3.5x 1 core credit or still 4x 1 core? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,706,621 RAC: 9,524 |
According to ye ancient scrolls of yore, a BOINC credit is also known as a cobblestone, defined as: By definition, 200 cobblestones are awarded for one day of work on a computer that can meet either of two benchmarks:That's all. Nothing else. Pure CPU grunt. No brownie points for complexity, cleverness, memory usage, disk usage, efficiency of execution, artistic merit, ..., ... In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,017,270 RAC: 20,902 |
In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone.You found it more quickly than I did Richard. These days, I think here Andy normally estimates the credit on the testing site and if credits awarded are significantly high or low for the amount of crunching time on the same computer George or less often one of the rest of us lets Andy know and he adjusts accordingly. There was a time when testing branch on CPDN gave double credits but that was before I joined the testing side of things. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Hmm. I'm used to a supercomputer environment where I would pay for how many compute nodes (CPU & memory), storage & archive. I see no reason why the same shouldn't apply to a boinc project using my machines. If it wants the faster cpu it should 'pay' more (i.e. give more credit). If it wants multiple cores & alot more memory, it should 'pay' by awarding more credit. My 2p worth.In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone.You found it more quickly than I did Richard. These days, I think here Andy normally estimates the credit on the testing site and if credits awarded are significantly high or low for the amount of crunching time on the same computer George or less often one of the rest of us lets Andy know and he adjusts accordingly. There was a time when testing branch on CPDN gave double credits but that was before I joined the testing side of things. I know credit always gives Andy headaches. I'll have a chat to him. Whatever we do it should be broadly consistent with the credits awarded for the hadley centre models (openifs is a faster model). |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone. I must be an old crusty. I do not care want a credit means but universe and milky-way award way too much credit for the work done. My three other projects (CPDN, Rosetta, WCG) award a somewhat "reasonable" amount of credit for each work unit. I think that milky-way awards credits for the time * number of cores effectively used. Since my machine is set up to run the multiprocessor tasks with four cores it credits about 3.65 cores for each work unit. |
Send message Joined: 16 Feb 12 Posts: 2 Credit: 520,461 RAC: 1,386 |
When will we see more work for our computers? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
When will we see more work for our computers?Hi Daniel, my windows machine has just picked up a Weather@Home task (I thought they had all gone out but maybe some are being sent out again). If you have linux (bare metal or virtual box), then I will be sending out some OpenIFS project work in October. Just waiting for some code updates on the boinc side and tests. There are also two other projects I know of with OpenIFS that will be submitting work later in the year. Hard to give more exact dates because it's a small team who have other project commitments. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,017,270 RAC: 20,902 |
Hi Daniel, my windows machine has just picked up a Weather@Home task (I thought they had all gone out but maybe some are being sent out again). The Windows task you got will be a resend with _1 or _2 at the end of the task name meaning it is on its second or third try after failing on one or two machines, or possibly being aborted. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
If you have linux (bare metal or virtual box), then I will be sending out some OpenIFS project work in October. Oh goody? Can you let us know a day ahead of time so I can tell my boinc client to allow new tasks from ClimatePrediction? As it is, I have new tasks refused because otherwise my boinc client will get no tasks from my other projects, so determined it is to get something from ClimatePrediction. |
©2024 cpdn.org