New work discussion

Author	Message
Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 66038 - Posted: 2 Sep 2022, 22:54:19 UTC Last modified: 6 Sep 2022, 5:27:49 UTC Please be patient. The new models will get get here when they're ready. ID: 66038 ·

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,691,690 RAC: 10,582	Message 66068 - Posted: 7 Sep 2022, 17:30:52 UTC A couple of unfinished points, continuing from the previous thread - specifically in reply to Glenn Carver's message 66054. 1) I don't think MilkyWay has a separate app version for each core count. Something like that would normally be handled by the plan_class mechanism, and the MilkyWay applications page only shows two application versions for the N-Body simulation - one for Windows, and the other for Linux. Both have the same simple [mt] plan_class. 2) I've set up a basic machine to run MilkyWay nbody tasks, and tracked the messages passing between the machine, the server, and the running science app. I think I've got a possible explanation of how they've done it. a) the machine is a small 4-core Intel, no hyperthreading, running Windows 10. I've set it, via local preferences, to use 80% of the available CPUs. That calculation is done in integer maths, so the machine has three cores available. b) the request file from the machine to the server contains these lines: <working_global_preferences> <global_preferences> <max_ncpus_pct>80.000000</max_ncpus_pct> </global_preferences> </working_global_preferences> <host_info> <p_ncpus>4</p_ncpus> </host_info> - so the local settings are reported to the server: "use 80% of 4 CPUs". c) The reply from the server, when new work is allocated, contains these lines: <app_version> <app_name>milkyway_nbody</app_name> <avg_ncpus>3.000000</avg_ncpus> </app_version> d) The allocated tasks are shown in BOINC Manager, and marked (3 CPUs) 3) When the BOINC client starts a new task, it populates an empty slot directory with the required files, and also creates its own file called "init_data.xml". That contains the lines: <ncpus>3.000000</ncpus> <host_info> <p_ncpus>4</p_ncpus> </host_info> "init_data.xml" can be read by the BOINC API library linked into a project app at compile time. I think that's how the app must be getting its threading instructions: "Although the processor has 4 cores (p_ncpus), only use 3 of them (ncpus). I can't find any other way it can be passed, and I've eyeballed every single occurrence of the digit '3' in these files. And still it reports "Using OpenMP 3 max threads on a system with 4 processors". ID: 66068 ·

Aurum Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318	Message 66069 - Posted: 7 Sep 2022, 17:41:38 UTC Last modified: 7 Sep 2022, 18:00:59 UTC Will the new work have user-friendly checkpointing? I sure would love to run climate & weather models. I searched for "checkpoint" and found nothing about it. As I recall checkpoints were 4 hours or so apart. That makes it very difficult to deal with heatwaves and TOU metering. Also looks like CPDN wants to use every CPU thread on your computer. Hopefully they'll fix that bug too. Edit: Found some info but not sure how old it is: https://www.climateprediction.net/getting-started/support/technical-faq/#no_tasks_available How long does a Timestep take in real time? "A Timestep represents a 1/2 hour of model time (not realtime)." "Climateprediction.net checkpoints every 144 Timesteps..." How do we make backups of a WU in-progress? "More worrying is that a computation error loses more work. What is the appropriate reaction to this? Complaining is unlikely to be useful as trying to make the Work Unit smaller has been considered and rejected as not practical. A better reaction would be to decide to make a backup from time to time so if you do suffer an error, you can recover without losing too much work." ID: 66069 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 66070 - Posted: 7 Sep 2022, 20:48:43 UTC - in response to Message 66069. Aurum These are things that we'll learn about when they arrive. ID: 66070 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 66072 - Posted: 7 Sep 2022, 20:58:39 UTC How do we make backups of a WU in-progress? Making backups is a hangover from when tasks often took 9 months to complete. The time taken to back up individual tasks really isn't worth it these days. ID: 66072 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66073 - Posted: 8 Sep 2022, 1:33:32 UTC - in response to Message 66069. Also looks like CPDN wants to use every CPU thread on your computer. Hopefully they'll fix that bug too. I am runnng Red Hat Enterprise Linux release 8.6 (Ootpa) on mu Linux box: Computer 1511241 CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.6 (Ootpa) [4.18.0-372.19.1.el8_6.x86_64\|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.28 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 482.7 GB Measured floating point speed 6.58 billion ops/sec Measured integer speed 30.58 billion ops/sec Average upload rate 738.83 KB/sec Average download rate 25591.7 KB/sec Average turnaround time 2.47 days With this in there, it does not use all the processors for CPDN, but only the 4 specified. [/var/lib/boinc/projects/climateprediction.net]$ cat app_config.xml <app_config> <project_max_concurrent>4</project_max_concurrent> </app_config> ID: 66073 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66075 - Posted: 8 Sep 2022, 9:56:37 UTC - in response to Message 66068. Last modified: 8 Sep 2022, 9:56:48 UTC "init_data.xml" can be read by the BOINC API library linked into a project app at compile time. I think that's how the app must be getting its threading instructions: "Although the processor has 4 cores (p_ncpus), only use 3 of them (ncpus). I can't find any other way it can be passed, and I've eyeballed every single occurrence of the digit '3' in these files. And still it reports "Using OpenMP 3 max threads on a system with 4 processors". Richard, thanks. That suggests MilkyWay will always run an app that fits into the available CPUs (which I think I've seen it do on my machine). For OpenIFS, that approach may not work. We will have 1-4 core versions available. If the init_data.xml tells the client I'm making 8 cores out of 16 total on my machine available, then the client will give OpenIFS wrapper code the wrong number. We'll probably have to use a different approach then to encode the correct number of threads to use. There's also the project preferences to consider when CPDN add in the ability for the user to restrict apps to below a certain core count. I am not sure how that mechanism works. Quite a few boinc issues to deal with before we can get the multicore work out to everyone. I don't want to populate this thread with a technical discussion, perhaps we can take this offline if need be. Many thanks for digging into MilkyWay's setup - that's useful. Cheers, Glenn ID: 66075 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66076 - Posted: 8 Sep 2022, 9:59:55 UTC - in response to Message 66072. Last modified: 8 Sep 2022, 10:00:02 UTC How do we make backups of a WU in-progress? Making backups is a hangover from when tasks often took 9 months to complete. The time taken to back up individual tasks really isn't worth it these days. The CPDN models create restart dumps at frequent intervals (which we configure on the server side). If the machine is powered down or boinc shutdown, the model restarts from these dumps when the client is restarted. There's absolutely no need to create your own backups of the work units. ID: 66076 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66077 - Posted: 8 Sep 2022, 15:50:42 UTC Planned OpenIFS configurations and memory Some info on memory requirements on upcoming OpenIFS forecasts. As mentioned previously, we're aiming to increase the model resolution to be more scientifically valuable. These resolutions come with higher memory requirements: N80 grid, 125km spacing. Peak RAM = 8Gb O96 grid, 100km " . Peak RAM = 10Gb N128 grid, 78km " . Peak RAM = 19Gb O160 grid, 61km " . Peak RAM = 24Gb All the above use 91 model levels. Previously CPDN has only used the 125km version with 60 model levels. Obviously these will be significantly more demanding than seen previously (I mentioned there will be additional credit for these and we'll use multicore for the higher resolutions). I would hope the first two to fit in 16Gb machines, the others will need 32Gb minimum (assuming of course people want to run these). Only machines which specify enough resource will get workunits. Timescale for testing is the next couple of months. This only refers to OpenIFS which runs globally at these resolutions and not the Hadley Centre models. Explanation of resolutions OpenIFS has 3 resolutions at play: the vertical resolution - number of discrete model levels; the grid spacing between discrete points on the globe; and the number of retained waves in 'spectral space'. We can alter these numbers (within some limits) to achieve a balance between model efficiency and scientific performance. The N number. The globe is cut into squares, with a rectangular grid. The number of points around a latitude is always double the number of points from pole to pole. The N number refers to the number of points between a pole and the equator. So a N80 grid will have 160 latitude points and 320 longitude points on the grid. The grid spacing is then just circumference of the Earth divided by 4N. The O number. Also specifies the number of points between pole & equator but in this case it implies what's called a 'cubic octahedral' grid. This is a different arrangement of the grid cells on the globe and it also implies fewer retained spectral waves in the model. Spectral resolution. This refers to the number of retained waves. It's analogous to a Fourier wave transform for, say, a sound spectrum, except on a globe. By modelling the wind & temperature as waves on the globe rather than at discrete gridpts we get a more accurate solution. You might also see the resolution expressed as T159L91. The T number '159' refers to the number of waves solved by the model, the 91 is the levels. 'O' grids are written as Tco159. Hope that's a useful reference. --- CPDN Visiting Scientist* ID: 66077 ·

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 66079 - Posted: 8 Sep 2022, 18:50:33 UTC - in response to Message 66069. As I recall checkpoints were 4 hours or so apart. That makes it very difficult to deal with heatwaves and TOU metering. Task suspend/resume, with the task remaining in memory, seems to work just fine. I've not had issues with this. Suspending the entire machine also works fine. My compute nodes are solar powered in my office, so they all sleep, every night, and I power them back on every morning. This doesn't cause any problems either - machine suspend/resume is invisible to tasks. The downside is that my stuff takes longer to complete than if it were running 24/7, but it's run entirely on surplus generation from an off grid system. I just try very hard not to crash the machines or tasks... "Suspend from the last checkpoint" has about a 50-75% success rate in my experience. ID: 66079 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66080 - Posted: 9 Sep 2022, 6:04:25 UTC - in response to Message 66075. Here is some more data on my Linux machine running an nbody milkyway task. This is part of the associated init_data.xml file in the slots directory. <app_init_data> <ncpus>4.000000</ncpus> <---<<<This is the number of tasks a work unit may use. <host_info> <p_ncpus>16</p_ncpus> <----<<< This is the number of cores the machine has. </host_info> <app_file>milkyway_nbody_1.82_x86_64-pc-linux-gnu__mt</app_file> </app_init_data> For a more uisual single processor milkyway task, it says <app_init_data> <ncpus>1.000000</ncpus> <host_info> <p_ncpus>16</p_ncpus> <app_file>milkyway_1.46_x86_64-pc-linux-gnu</app_file> </app_init_data> So that is how they do it. ID: 66080 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66081 - Posted: 9 Sep 2022, 14:33:02 UTC - in response to Message 66080. Yep, thanks. From Richard's earlier message, the server sends the app_version data with <avg_ncpus> set to > 1 (that's the key part) and the client creates the init_data.xml from this information when it starts the task, where the value from <avg_ncpus> is copied into <ncpus> (would be nice if the naming was consistent). I also need to understand how credit it worked out with multithreaded apps. i.e. is it just 4x 1 core credit or does it take the scaling efficiency into account. i.e. if 4 threads gives a 3.5 speedup, is credit then 3.5x 1 core credit or still 4x 1 core? ID: 66081 ·

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,691,690 RAC: 10,582	Message 66082 - Posted: 9 Sep 2022, 15:18:41 UTC - in response to Message 66081. According to ye ancient scrolls of yore, a BOINC credit is also known as a cobblestone, defined as: By definition, 200 cobblestones are awarded for one day of work on a computer that can meet either of two benchmarks: 1,000 double-precision MFLOPS based on the Whetstone benchmark 1,000 VAX MIPS based on the Dhrystone benchmark That's all. Nothing else. Pure CPU grunt. No brownie points for complexity, cleverness, memory usage, disk usage, efficiency of execution, artistic merit, ..., ... In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone. ID: 66082 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 66083 - Posted: 9 Sep 2022, 15:31:59 UTC - in response to Message 66082. In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone. You found it more quickly than I did Richard. These days, I think here Andy normally estimates the credit on the testing site and if credits awarded are significantly high or low for the amount of crunching time on the same computer George or less often one of the rest of us lets Andy know and he adjusts accordingly. There was a time when testing branch on CPDN gave double credits but that was before I joined the testing side of things. ID: 66083 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66085 - Posted: 9 Sep 2022, 19:11:39 UTC - in response to Message 66083. Last modified: 9 Sep 2022, 19:11:47 UTC In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone. You found it more quickly than I did Richard. These days, I think here Andy normally estimates the credit on the testing site and if credits awarded are significantly high or low for the amount of crunching time on the same computer George or less often one of the rest of us lets Andy know and he adjusts accordingly. There was a time when testing branch on CPDN gave double credits but that was before I joined the testing side of things. Hmm. I'm used to a supercomputer environment where I would pay for how many compute nodes (CPU & memory), storage & archive. I see no reason why the same shouldn't apply to a boinc project using my machines. If it wants the faster cpu it should 'pay' more (i.e. give more credit). If it wants multiple cores & alot more memory, it should 'pay' by awarding more credit. My 2p worth. I know credit always gives Andy headaches. I'll have a chat to him. Whatever we do it should be broadly consistent with the credits awarded for the hadley centre models (openifs is a faster model). ID: 66085 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66086 - Posted: 9 Sep 2022, 20:29:40 UTC - in response to Message 66082. In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone. I must be an old crusty. I do not care want a credit means but universe and milky-way award way too much credit for the work done. My three other projects (CPDN, Rosetta, WCG) award a somewhat "reasonable" amount of credit for each work unit. I think that milky-way awards credits for the time * number of cores effectively used. Since my machine is set up to run the multiprocessor tasks with four cores it credits about 3.65 cores for each work unit. ID: 66086 ·

Daniel Send message Joined: 16 Feb 12 Posts: 2 Credit: 518,804 RAC: 1,360	Message 66109 - Posted: 16 Sep 2022, 18:12:01 UTC When will we see more work for our computers? ID: 66109 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66110 - Posted: 16 Sep 2022, 21:13:51 UTC - in response to Message 66109. Last modified: 16 Sep 2022, 21:14:53 UTC When will we see more work for our computers? Hi Daniel, my windows machine has just picked up a Weather@Home task (I thought they had all gone out but maybe some are being sent out again). If you have linux (bare metal or virtual box), then I will be sending out some OpenIFS project work in October. Just waiting for some code updates on the boinc side and tests. There are also two other projects I know of with OpenIFS that will be submitting work later in the year. Hard to give more exact dates because it's a small team who have other project commitments. --- CPDN Visiting Scientist ID: 66110 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 66112 - Posted: 17 Sep 2022, 7:07:33 UTC Hi Daniel, my windows machine has just picked up a Weather@Home task (I thought they had all gone out but maybe some are being sent out again). The Windows task you got will be a resend with _1 or _2 at the end of the task name meaning it is on its second or third try after failing on one or two machines, or possibly being aborted. ID: 66112 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66113 - Posted: 17 Sep 2022, 13:35:01 UTC - in response to Message 66110. If you have linux (bare metal or virtual box), then I will be sending out some OpenIFS project work in October. Oh goody? Can you let us know a day ahead of time so I can tell my boinc client to allow new tasks from ClimatePrediction? As it is, I have new tasks refused because otherwise my boinc client will get no tasks from my other projects, so determined it is to get something from ClimatePrediction. ID: 66113 ·

New work discussion - 2