Thread 'UK Met Office HadAM4 at N216 resolution'

Author	Message
Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 61512 - Posted: 10 Nov 2019, 1:33:21 UTC - in response to Message 61511. New for me having been only Windows for many years have now got a VM running Ubuntu 18.04 (one core out of 4 on a 3.3GHz i5, allocated 3Gb RAM) have now got one of these to go with an N144. Will see how it goes. 3GBytes of RAM for a N144 task? That seems like a lot. I am running four N216 tasks on Red Hat Enterprise Linux Server release 6.10 (Santiago) ... CPU type GenuineIntel Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7] Number of processors 4 Operating System Linux 2.6.32-754.23.1.el6.x86_64 Memory 15.5 GB and they are taking 1.35 GB of virtual memory (each) and the working set size is 1.33 GBytes. I do not have any N144 tasks at the moment. ID: 61512 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 61513 - Posted: 10 Nov 2019, 1:34:19 UTC - in response to Message 61460. Last modified: 10 Nov 2019, 2:13:08 UTC The project are looking at measures to deal with the large number of computers missing the 32bit libraries or otherwise crashing everything on these and other Linux tasks. Each task will now get a maximum of 5 attempts rather than the three we have seen in the past. This may mean a small increase in the number of tasks that fail on all computers being received. They are also looking at blocking computers from getting tasks till they sort things out. If the latter is effective I guess the former may only be a temporary measure. One particularly egregious host was found that had crashed over a thousand of these! I just saw hadam4h task from batches 842 and 843 with a _3 and a _4 tail download on one of my machines. Glad to take them. <edit> Exec summary These hadam4h N216 models can run in about a week on even old machines. How many cores to load before thruput drops off fast -- that's the question <edit> Unfortunately I've a 2-to-3-week backlog of these wu's mostly account of early underestimated completion times and optimism about how much help the extra cores on recent Intel and AMD might give. It seems there is a "sweet spot" for various cpus of various ages and designs, a minimum that depends on the generation, microarch, and load. I can't generalize yet, but here's 3 examples. AMD Phenom II 6-core completes 1 model in 5.7 days no other load. running 3 models on the 6 "cores" all 3 at once take 19.6 days to finish or 6.5 days per model Intel i7700 running 4 models all 4 cores, (no threading) 8.8 days for 4 models 2.2 per each. 2 hadam4h on this box takes 6.2 days; 3.1 days each. Running 3 at once takes (estimated) 7.1 days or 2.4 cpu days each. In this case it pays well to run 4 models (compare sandy bridge thru Haswell) Intel i8700K running 6 models so loading all 6 "cores" 12 days to finish 6 models 2days per model. Barely better than the 6.2 days to complete 3 models with only 3 at once. and sure better througput than running only 2 models on the 6-core box, when estimate is 5.3 days for 2 to complete -- 2.6 days per model Doh. Hope to report more on the speed vs core-count thing on newer Ryzen 2700x and 3900X - first observation is that using all 8 or 12 cores gets no more throughput than using 4 or 6 right around a week per model. (Don't know how all that 64M L3 on the 3900X is shared - many other questions.) ID: 61513 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638	Message 61516 - Posted: 10 Nov 2019, 23:03:40 UTC - in response to Message 61512. "3GBytes of RAM for a N144 task? That seems like a lot." That's what is allocated to the virtual machine. ID: 61516 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431	Message 61522 - Posted: 11 Nov 2019, 10:10:28 UTC One just failed with a seg fault, shortly before first zip so no credit from that one. :( Not that I have looked at my credit for some time anyway. ID: 61522 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638	Message 61529 - Posted: 13 Nov 2019, 23:21:26 UTC - in response to Message 61516. The N144 task has just completed in about 160 hours run time. ID: 61529 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 61541 - Posted: 15 Nov 2019, 19:58:41 UTC - in response to Message 61522. Last modified: 15 Nov 2019, 20:01:59 UTC One just failed with a seg fault, shortly before first zip so no credit from that one. :( I have something a bit more unusual. All the ones shown as "completed" on my Ryzen 3700x actually have this in the stderr: Signal 15 received: Software termination signal from kill Signal 15 received: Abnormal termination triggered by abort call Signal 15 received, exiting... SIGSEGV: segmentation violation https://www.cpdn.org/results.php?hostid=1493935 That is interesting. EDIT: I see it occasionally on my other machines too. Maybe you have to fail in order to succeed. ID: 61541 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 61542 - Posted: 15 Nov 2019, 20:32:47 UTC - in response to Message 61541. It might be caused by travelling at such high speed. Mine travel at a more sedate speed, and stop by easing down on the brake. :) Task 21759776 Well, when I say sedate speed, I have been getting complaints from snails wanting to get past. :) ID: 61542 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 61543 - Posted: 15 Nov 2019, 20:45:37 UTC - in response to Message 61542. Well, now that you put me up to it, I checked the last twelve successful completions on all machines. Nine of them show the segfaults, and three don't. Some of each are from each machine, both Intel and Ryzen. I won't try to do a statistical analysis. ID: 61543 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638	Message 61574 - Posted: 18 Nov 2019, 23:21:29 UTC - in response to Message 61529. My N216 task has reached 25% after about 4.5 days. I've also got 2 batch 848, one of which is 15% after 1day. ID: 61574 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638	Message 61622 - Posted: 28 Nov 2019, 23:09:44 UTC - in response to Message 61574. This has now reached over 50% with time steps between 43 and 45 sec/ts. ID: 61622 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431	Message 61676 - Posted: 12 Dec 2019, 19:45:14 UTC Looking at the failures for these and the N144 tasks, most of the failures that are not down to missing 32bit libs are from machines running large numbers of these tasks concurrently. (Often 16, 24 or even more in some cases.) Some of these machines are failing nearly everything they get. If this applies to you, please try under computing preferences, reducing the number of cores in use. A lot of computer time is being wasted by this issue. ID: 61676 · Reply Quote

Alex Plantema Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,610,380 RAC: 3,377	Message 61677 - Posted: 12 Dec 2019, 21:50:52 UTC - in response to Message 61676. It's a matter of statistics. There's simply more chance that you encounter a machine running many tasks concurrently, so it doesn't need to be the cause. ID: 61677 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 61861 - Posted: 31 Dec 2019, 4:31:06 UTC - in response to Message 61543. Well, now that you put me up to it, I checked the last twelve successful completions on all machines. Nine of them show the segfaults, and three don't. Some of each are from each machine, both Intel and Ryzen. I won't try to do a statistical analysis. For me, these "N216" have never failed. They take almost two weeks but unlike the "144" things -- no fails here Give me more of the "N216" ID: 61861 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 61862 - Posted: 31 Dec 2019, 4:47:48 UTC - in response to Message 61676. Looking at the failures for these and the N144 tasks, most of the failures that are not down to missing 32bit libs are from machines running large numbers of these tasks concurrently. (Often 16, 24 or even more in some cases.) Some of these machines are failing nearly everything they get. If this applies to you, please try under computing preferences, reducing the number of cores in use. A lot of computer time is being wasted by this issue. Me,I say, never load the "Ncore-2Nthread" CPU more than "Ncore" .AND for the many-core 6-16 core cpu's -- allow less than the number of cores to run these N216 models. Preliminary stats from my fastest boxes show that the "biggie manycore boxes" I'm running are most productive at about half the "core-count" -- I've not any "Threadripper or i9-10" BUT - the RYZEN 9- 39xx is twice as fast on these L2 and L3 hogs as any other "not-quite-bleeding-edge" box I'm running You can look at my public stats. Near zero fails on the N216 batches. Random fails on the N144 batches. I've no clue why ID: 61862 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 61863 - Posted: 31 Dec 2019, 7:44:19 UTC - in response to Message 61861. For me, these "N216" have never failed. They take almost two weeks but unlike the "144" things -- no fails here It looks like you are running the Linux 5.2.0 or 5.3.0 kernel. That may help. ID: 61863 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431	Message 61864 - Posted: 31 Dec 2019, 11:16:15 UTC Looking at the batch statistics for the most recent 214 and 216 batches, the 216 is at 7% hard fails and the 214 at 13% and not all of them have gone out yet. To do a completely fair comparison I should probably wait till more have finished but it does suggest to me that at least between #852 and #853, the 144 batch ones are more likely to fail. On the 216 batch, of the first five hared fails listed so all failed 3 times, only two out of the fifteen fails are not missing libraries. On the 144 batch there are six non missing library fails out of the 15. Probably need to look at more than just ten tasks to confirm reasons but the 144's seem more prone to invalid theta errors which means they are pushing the limits of the models and producing impossible climates such as -ve air pressure which is one I have seen in the past. This may well end up being different for some N144 batches and it can't be assumed that these figures apply to all and putting different numbers into the initial conditions of batches may change things a lot. Some of the tasks involved are not really hard fails yet as the number of attempts for Linux tasks has been upped to four. ID: 61864 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 61865 - Posted: 31 Dec 2019, 14:14:34 UTC - in response to Message 61864. This one failed for me due to negative pressure. I do have the required 32-bit libraries as evidenced by other 144 work have completed successfully, as have the 216 ones. Task 21866633 Name hadam4_a06t_209110_6_856_011961960_3 Workunit 11961960 My failed one says. Workunit 11961960 name hadam4_a06t_209110_6_856_011961960 application UK Met Office HadAM4 at N144 resolution created 9 Dec 2019, 1:30:09 UTC minimum quorum 1 initial replication 1 max # of error/total/success tasks 5, 5, 1 errors Too many total results No need to do anything about this, I suppose: just another data point. ID: 61865 · Reply Quote

wolfman1360 Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,024,464 RAC: 5,225	Message 61967 - Posted: 11 Jan 2020, 1:04:03 UTC Making sure I understand this correctly. I have a dual Xeon 2670v2 with 25 MB l3 cache per processor. Will Boinc and the workunits see this as 50 total - thus I should be able to run 10-12 concurrently? (leaning more toward less at this point) ID: 61967 · Reply Quote

lazlo_vii Send message Joined: 11 Dec 19 Posts: 108 Credit: 3,012,142 RAC: 0	Message 61968 - Posted: 11 Jan 2020, 1:19:42 UTC - in response to Message 61967. It's not really about how BOINC sees it. It is complicated to answer because it has to do with how the kernel's process scheduler and memory management more data into and out of RAM to and from the CPUs. The kernel can only get data in/out of RAM at the speed of the RAM and it can only move data between CPU cores at the speed of the CPU die interconnect and it can only move data between CPU sockets at the speed of the motherboard's North Bridge controller. My advice is to start with less and work up to more until you find a sweet spot. ID: 61968 · Reply Quote

wolfman1360 Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,024,464 RAC: 5,225	Message 61973 - Posted: 12 Jan 2020, 4:38:36 UTC - in response to Message 61968. It's not really about how BOINC sees it. It is complicated to answer because it has to do with how the kernel's process scheduler and memory management more data into and out of RAM to and from the CPUs. The kernel can only get data in/out of RAM at the speed of the RAM and it can only move data between CPU cores at the speed of the CPU die interconnect and it can only move data between CPU sockets at the speed of the motherboard's North Bridge controller. My advice is to start with less and work up to more until you find a sweet spot. That makes complete sense and clarifies a lot. Thanks. This probably goes for a lot of projects. ID: 61973 · Reply Quote