Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
New for me having been only Windows for many years have now got a VM running Ubuntu 18.04 (one core out of 4 on a 3.3GHz i5, allocated 3Gb RAM) have now got one of these to go with an N144. Will see how it goes. 3GBytes of RAM for a N144 task? That seems like a lot. I am running four N216 tasks on Red Hat Enterprise Linux Server release 6.10 (Santiago) ... CPU type GenuineIntel Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7] Number of processors 4 Operating System Linux 2.6.32-754.23.1.el6.x86_64 Memory 15.5 GB and they are taking 1.35 GB of virtual memory (each) and the working set size is 1.33 GBytes. I do not have any N144 tasks at the moment. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
The project are looking at measures to deal with the large number of computers missing the 32bit libraries or otherwise crashing everything on these and other Linux tasks. I just saw hadam4h task from batches 842 and 843 with a _3 and a _4 tail download on one of my machines. Glad to take them. <edit> Exec summary These hadam4h N216 models can run in about a week on even old machines. How many cores to load before thruput drops off fast -- that's the question <edit> Unfortunately I've a 2-to-3-week backlog of these wu's mostly account of early underestimated completion times and optimism about how much help the extra cores on recent Intel and AMD might give. It seems there is a "sweet spot" for various cpus of various ages and designs, a minimum that depends on the generation, microarch, and load. I can't generalize yet, but here's 3 examples. AMD Phenom II 6-core completes 1 model in 5.7 days no other load. running 3 models on the 6 "cores" all 3 at once take 19.6 days to finish or 6.5 days per model Intel i7700 running 4 models all 4 cores, (no threading) 8.8 days for 4 models 2.2 per each. 2 hadam4h on this box takes 6.2 days; 3.1 days each. Running 3 at once takes (estimated) 7.1 days or 2.4 cpu days each. In this case it pays well to run 4 models (compare sandy bridge thru Haswell) Intel i8700K running 6 models so loading all 6 "cores" 12 days to finish 6 models 2days per model. Barely better than the 6.2 days to complete 3 models with only 3 at once. and sure better througput than running only 2 models on the 6-core box, when estimate is 5.3 days for 2 to complete -- 2.6 days per model Doh. Hope to report more on the speed vs core-count thing on newer Ryzen 2700x and 3900X - first observation is that using all 8 or 12 cores gets no more throughput than using 4 or 6 right around a week per model. (Don't know how all that 64M L3 on the 3900X is shared - many other questions.) |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638 |
"3GBytes of RAM for a N144 task? That seems like a lot." That's what is allocated to the virtual machine. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
One just failed with a seg fault, shortly before first zip so no credit from that one. :( Not that I have looked at my credit for some time anyway. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638 |
The N144 task has just completed in about 160 hours run time. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
One just failed with a seg fault, shortly before first zip so no credit from that one. :( I have something a bit more unusual. All the ones shown as "completed" on my Ryzen 3700x actually have this in the stderr: Signal 15 received: Software termination signal from kill Signal 15 received: Abnormal termination triggered by abort call Signal 15 received, exiting... SIGSEGV: segmentation violation https://www.cpdn.org/results.php?hostid=1493935 That is interesting. EDIT: I see it occasionally on my other machines too. Maybe you have to fail in order to succeed. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It might be caused by travelling at such high speed. Mine travel at a more sedate speed, and stop by easing down on the brake. :) Task 21759776 Well, when I say sedate speed, I have been getting complaints from snails wanting to get past. :) |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Well, now that you put me up to it, I checked the last twelve successful completions on all machines. Nine of them show the segfaults, and three don't. Some of each are from each machine, both Intel and Ryzen. I won't try to do a statistical analysis. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638 |
My N216 task has reached 25% after about 4.5 days. I've also got 2 batch 848, one of which is 15% after 1day. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638 |
This has now reached over 50% with time steps between 43 and 45 sec/ts. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
Looking at the failures for these and the N144 tasks, most of the failures that are not down to missing 32bit libs are from machines running large numbers of these tasks concurrently. (Often 16, 24 or even more in some cases.) Some of these machines are failing nearly everything they get. If this applies to you, please try under computing preferences, reducing the number of cores in use. A lot of computer time is being wasted by this issue. |
Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,610,380 RAC: 3,377 |
It's a matter of statistics. There's simply more chance that you encounter a machine running many tasks concurrently, so it doesn't need to be the cause. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Well, now that you put me up to it, I checked the last twelve successful completions on all machines. For me, these "N216" have never failed. They take almost two weeks but unlike the "144" things -- no fails here Give me more of the "N216" |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Looking at the failures for these and the N144 tasks, most of the failures that are not down to missing 32bit libs are from machines running large numbers of these tasks concurrently. (Often 16, 24 or even more in some cases.) Some of these machines are failing nearly everything they get. If this applies to you, please try under computing preferences, reducing the number of cores in use. A lot of computer time is being wasted by this issue. Me,I say, never load the "Ncore-2Nthread" CPU more than "Ncore" .AND for the many-core 6-16 core cpu's -- allow less than the number of cores to run these N216 models. Preliminary stats from my fastest boxes show that the "biggie manycore boxes" I'm running are most productive at about half the "core-count" -- I've not any "Threadripper or i9-10" BUT - the RYZEN 9- 39xx is twice as fast on these L2 and L3 hogs as any other "not-quite-bleeding-edge" box I'm running You can look at my public stats. Near zero fails on the N216 batches. Random fails on the N144 batches. I've no clue why |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
For me, these "N216" have never failed. They take almost two weeks but unlike the "144" things -- no fails here It looks like you are running the Linux 5.2.0 or 5.3.0 kernel. That may help. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
Looking at the batch statistics for the most recent 214 and 216 batches, the 216 is at 7% hard fails and the 214 at 13% and not all of them have gone out yet. To do a completely fair comparison I should probably wait till more have finished but it does suggest to me that at least between #852 and #853, the 144 batch ones are more likely to fail. On the 216 batch, of the first five hared fails listed so all failed 3 times, only two out of the fifteen fails are not missing libraries. On the 144 batch there are six non missing library fails out of the 15. Probably need to look at more than just ten tasks to confirm reasons but the 144's seem more prone to invalid theta errors which means they are pushing the limits of the models and producing impossible climates such as -ve air pressure which is one I have seen in the past. This may well end up being different for some N144 batches and it can't be assumed that these figures apply to all and putting different numbers into the initial conditions of batches may change things a lot. Some of the tasks involved are not really hard fails yet as the number of attempts for Linux tasks has been upped to four. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
This one failed for me due to negative pressure. I do have the required 32-bit libraries as evidenced by other 144 work have completed successfully, as have the 216 ones. Task 21866633 Name hadam4_a06t_209110_6_856_011961960_3 Workunit 11961960 My failed one says. Workunit 11961960 name hadam4_a06t_209110_6_856_011961960 application UK Met Office HadAM4 at N144 resolution created 9 Dec 2019, 1:30:09 UTC minimum quorum 1 initial replication 1 max # of error/total/success tasks 5, 5, 1 errors Too many total results No need to do anything about this, I suppose: just another data point. |
Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,024,464 RAC: 5,225 |
Making sure I understand this correctly. I have a dual Xeon 2670v2 with 25 MB l3 cache per processor. Will Boinc and the workunits see this as 50 total - thus I should be able to run 10-12 concurrently? (leaning more toward less at this point) |
Send message Joined: 11 Dec 19 Posts: 108 Credit: 3,012,142 RAC: 0 |
It's not really about how BOINC sees it. It is complicated to answer because it has to do with how the kernel's process scheduler and memory management more data into and out of RAM to and from the CPUs. The kernel can only get data in/out of RAM at the speed of the RAM and it can only move data between CPU cores at the speed of the CPU die interconnect and it can only move data between CPU sockets at the speed of the motherboard's North Bridge controller. My advice is to start with less and work up to more until you find a sweet spot. |
Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,024,464 RAC: 5,225 |
It's not really about how BOINC sees it. It is complicated to answer because it has to do with how the kernel's process scheduler and memory management more data into and out of RAM to and from the CPUs. The kernel can only get data in/out of RAM at the speed of the RAM and it can only move data between CPU cores at the speed of the CPU die interconnect and it can only move data between CPU sockets at the speed of the motherboard's North Bridge controller. That makes complete sense and clarifies a lot. Thanks. This probably goes for a lot of projects. |
©2024 cpdn.org