Thread 'UK Met Office HadAM4 at N216 resolution'

Author	Message
Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 61411 - Posted: 27 Oct 2019, 19:28:19 UTC When we were running the big production lot of OpenIFS to prove that it could be done, my machine was getting around 1 hour and 15 minutes for 4. When I increased this to 8, it slowed to 2 hours and 30 minutes. When I mentioned this to the main researcher, he said that was normal, and that he didn't run any on hyperthreading. ID: 61411 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 61412 - Posted: 27 Oct 2019, 19:53:45 UTC - in response to Message 61411. When we were running the big production lot of OpenIFS to prove that it could be done, my machine was getting around 1 hour and 15 minutes for 4. When I increased this to 8, it slowed to 2 hours and 30 minutes. When I mentioned this to the main researcher, he said that was normal, and that he didn't run any on hyperthreading. And the slowdown on my laptop with only 2GB/core (4 cores no hyperthreading) was an even bigger percentage if I ran all four. Best throughput was running two at once. Lack of memory meant a lot of swapping out to disk. All reasons why I am looking forward to a new faster machine! ID: 61412 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 61433 - Posted: 31 Oct 2019, 11:04:27 UTC - in response to Message 61382. What the long time between checkpoints does mean for these tasks is that on computers that get switched off several times a day will never finish because if they have not reached the first checkpoint they will restart from the beginning. I just started Name hadam4h_a18g_201111_4_842_011906056_2 Workunit 11906056 Created 30 Oct 2019, 12:56:26 UTC Sent 30 Oct 2019, 16:46:39 UTC CPU time at last checkpoint 12:45:28 CPU time 12:51:34 Elapsed time 14:21:29 So they certainly checkpoint more often than they trickle. Remember, when looking at these times, that my machine has a 1.8 GHz 4-core 64-bit Xeon processor that runs at about half the speed of current processors. ID: 61433 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 61434 - Posted: 31 Oct 2019, 11:30:07 UTC - in response to Message 61433. CPU time 19:09:58 CPU time since checkpoint 02:08:53 Elapsed time 19:45:39 Estimated time remaining 5d 15:14:14 Fraction done 2.806% I also have an old computer. The fastest machines are finishing at least three times as fast as this one. Pentium(R) Dual-Core CPU E5400 @ 2.70GHz [Family 6 Model 23 Stepping 10] ID: 61434 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 61443 - Posted: 1 Nov 2019, 14:36:49 UTC After running fine for about a week on my Ryzen 3700x, I needed to reboot it. There did not appear to be any problems, and the work units I saw running thereafter were proceeding normally, with no errors shown at my end. But when I checked the Tasks page, I see that six of them show as "aborted". The message is: "203 (0x000000CB) EXIT_ABORTED_VIA_GUI". https://www.cpdn.org/results.php?hostid=1493935 It is possible that these are the ones that had not started yet, and were waiting in the queue. They all show as 0.00 seconds run time. And they each show as "Error while computing" (not "aborted") by two other users, also with short run times. It is not a problem for me, but I thought the developers might be interested how it happened. ID: 61443 · Reply Quote

alanb1951 Send message Joined: 31 Aug 04 Posts: 38 Credit: 9,581,380 RAC: 3,853	Message 61444 - Posted: 2 Nov 2019, 5:52:10 UTC Jim1348 - re your Ryzen tasks... I followed the link in your post above to see how you might be getting on with tasks that didn't abort(!) and was intrigued to see tasks taking well over 20 seconds per time step. So far I've finished two (each took about 7 days, 7hours) and in both cases the average per time step is under 18 seconds... So I wondered how many you are running at a time and, perhaps, what your overall workload is on that system. On mine I only allow 2 CPDN at a time (and I also only allow 2 WCG MIP1 (cache-killers!)) - I also only let BOINC have 14 out of 16 "CPUs" Fun machines, aren't they! I could write an essay about the machine turning the CPU fan off with 14 tasks running, but I won't... Cheers - Al. ID: 61444 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 61446 - Posted: 2 Nov 2019, 12:04:39 UTC - in response to Message 61444. Last modified: 2 Nov 2019, 12:08:45 UTC I followed the link in your post above to see how you might be getting on with tasks that didn't abort(!) and was intrigued to see tasks taking well over 20 seconds per time step. So far I've finished two (each took about 7 days, 7hours) and in both cases the average per time step is under 18 seconds... So I wondered how many you are running at a time and, perhaps, what your overall workload is on that system. On mine I only allow 2 CPDN at a time (and I also only allow 2 WCG MIP1 (cache-killers!)) - I also only let BOINC have 14 out of 16 "CPUs". That is a long (long) story, that I have posted on at some length. Yes, limiting the cores helps. I have found what works, more or less, and am just finishing up some now. But in a fit of hope over experience, I am putting together a Ryzen 3600 tomorrow, and will see what it does. It has only six full cores (with HT disabled), with lots of cache. It might work. Thanks for the input. ID: 61446 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 61449 - Posted: 4 Nov 2019, 5:55:53 UTC I now have four UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu work units running on my four cores. 21785482 21761296 21784271 21760249 One has completed three trickles, but the other three have only been running a short while and have not produced a trickle. Those three trickles were running 52.3026 seconds/TS for the 25% complete one 52.2617 seconds/TS for the 50% complete one and 52.3267seconds/TS for the 75% complete one. Each is getting over 97% of the cpu time of the processor it runs on. The processor is not a fast one by today's standards, but it seems to have an unusually large cache for what it is. CPU type GenuineIntel Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7] Operating System Linux 2.6.32-754.23.1.el6.x86_64 BOINC version 7.2.33 Memory 15.5 GB Cache 10240 KB Swap space 3.91 GB Total disk space 117.21 GB Free Disk Space 98.49 GB Measured floating point speed 1.27 billion ops/sec Measured integer speed 3.53 billion ops/sec Average upload rate 3009.79 KB/sec Average download rate 9768.13 KB/sec ID: 61449 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 61456 - Posted: 5 Nov 2019, 5:09:02 UTC Last modified: 21 Feb 2021, 3:24:43 UTC I made some modifications to my previous post on the comparison of CPUs, their L3 cache, the model speed, and how much the models slow down as more are added. I added a Ryzen 2600X, and compared 1 vs. 2 vs. 4 models and their speeds. CPU L3 cache 1 model 2 models (% slower than 1) 4 models (% slower than 1) Ryzen 5600X (32 MB) 9.5 sec/TS 9.9 sec/TS ( 4%) 11.8 sec/TS (24%) Ryzen 3600X (32 MB) 11.2 sec/TS 11.4 sec/TS ( 2%) 13.6 sec/TS (21%) Ryzen 2600X (16 MB) 13.0 sec/TS 13.7 sec/TS ( 5%) 17.8 sec/TS (37%) Haswell 4790K ( 8 MB) 13.9 sec/TS 15.9 sec/TS (14%) 22.0 sec/TS (58%) ID: 61456 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 61458 - Posted: 5 Nov 2019, 6:59:40 UTC Interestingly, the Rainfall Africa tasks over at World Community Grid seem to be similarly affected. Running one alongside an N216 task doesn't make my ageing desktop as sluggish as running 2 N2n6's but does at times slow down its responsiveness. Just another thing to be aware of. ID: 61458 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 61460 - Posted: 5 Nov 2019, 12:15:55 UTC The project are looking at measures to deal with the large number of computers missing the 32bit libraries or otherwise crashing everything on these and other Linux tasks. Each task will now get a maximum of 5 attempts rather than the three we have seen in the past. This may mean a small increase in the number of tasks that fail on all computers being received. They are also looking at blocking computers from getting tasks till they sort things out. If the latter is effective I guess the former may only be a temporary measure. One particularly egregious host was found that had crashed over a thousand of these! ID: 61460 · Reply Quote

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228	Message 61477 - Posted: 6 Nov 2019, 14:05:10 UTC Before I spend lots of time trying to optimise my set up (Ryzen 2600 running Ubuntu 18.04 currently running 4 HADAM4/N216, 4 WCG and 4 Rosetta) Iâ€™d like to understand how the credits are allocated. Iâ€™m sure I saw in the faq that credits follow the trickle and will be up to 12 hours later. My first set of trickles saw credits about 3 days later but since then nothing, Iâ€™ve had 7 more trickles the earliest of them over 4 days ago. This is looking at my account on cpdn.org so does not take account of any delay posting the credits to BOINC stats. ID: 61477 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,944,701 RAC: 2,164	Message 61478 - Posted: 6 Nov 2019, 14:23:25 UTC - in response to Message 61477. [Bryn Mawr wrote:]... Iâ€™d like to understand how the credits are allocated. ... All being well, the project runs a credit allocation script at weekly intervals, so the allocated credits (such as in the Statistics tab in BOINC Manager) will show jumps as the accumulated trickles are processed. Sometimes the credit script fails and no credits are allocated, at which point someone here will inevitably prompt the project to re-run the script, if they haven't spotted it themselves. Welcome to the message board! ID: 61478 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 61479 - Posted: 6 Nov 2019, 16:13:51 UTC Last modified: 6 Nov 2019, 19:33:51 UTC Iâ€™m sure I saw in the faq that credits follow the trickle and will be up to 12 hours later. My first set of trickles saw credits about 3 days later but since then nothing, Iâ€™ve had 7 more trickles the earliest of them over 4 days ago. It is many years since I looked at the faq. I will do so later this evening and if it still says that I will prompt the project to change it. Pretty sure that course will be effective but it may take a week or so. Edit: looking at the FAQ page, I didn't read it all but it does state that the credit script runs only once per day rather than once per week. I have emailed the project suggesting this is updated. ID: 61479 · Reply Quote

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228	Message 61480 - Posted: 6 Nov 2019, 17:53:26 UTC - in response to Message 61478. [Bryn Mawr wrote:]... Iâ€™d like to understand how the credits are allocated. ... All being well, the project runs a credit allocation script at weekly intervals, so the allocated credits (such as in the Statistics tab in BOINC Manager) will show jumps as the accumulated trickles are processed. Sometimes the credit script fails and no credits are allocated, at which point someone here will inevitably prompt the project to re-run the script, if they haven't spotted it themselves. Welcome to the message board! Many thanks, Iâ€™ll stop worrying about the trivia and start looking at the optimization. It looks to be a knowledgable and friendly group here :-) ID: 61480 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 61483 - Posted: 7 Nov 2019, 7:35:11 UTC Last modified: 7 Nov 2019, 12:47:48 UTC The long time between checkpoints on this batch seems to have one useful side effect. (Following an update to ermine my desktop started suffering from a black screen after I had gone away so l would have to reboot to get back in to it meaning I had lots of reboots before I finally resolved the issue.) I guess a few more machines might be needed to prove it but to my mind at least the fact that neither task has crashed yet confirms that it is stopping BOINC while a checkpoint is being written that causes the tasks to crash. Over an hour between checkpoints on this box means chances of doing something during one is greatly reduced making the tasks more robust. Further evidence that this might be the case is that the only reasons I have seen on the crashed tasks for this batch are missing 32bit libs and one machine with 40 cores that has too little memory and so is crashing everything because of that. Edit:However return rate seems to be much lower than expected so checkpoint interval will be decreased. This is because tasks can do two hours or even more crunching, then if BOINC is stopped, they start again at the beginning. So again, please suspend rather than switching off computers if running these tasks on machines that are switched off regularly. ID: 61483 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 61485 - Posted: 7 Nov 2019, 14:08:46 UTC I can now post progress in finding a CPU that works really well, after somewhat disappointing results on other machines. My new Ryzen 3600 allows me to run six work units (50% of the cores), with each now estimating a completion time of 7 days 18 hours (48% complete). https://www.cpdn.org/results.php?hostid=1494480 The trickles are at 19 sec/TS. Curiously, it is better than a Ryzen 3700x, which could run only about four cores efficiently, even though it is the same architecture as the 3600. It seems that they make use of the cache differently. It is also better than an i7-9700, where I could run only four cores effectively. The 3600 comes at a nice discount compared to the other members of the family, so would make a great machine for new builders. ID: 61485 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 61488 - Posted: 7 Nov 2019, 14:43:16 UTC - in response to Message 61485. The 3600 comes at a nice discount compared to the other members of the family, so would make a great machine for new builders. Thank you, I am looking at upgrading my machine along the lines of the axe that has been in the family for generations. In this case (sic) the blade will be the motherboard, memory and cpu, the handle will be the hard disks. ID: 61488 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 61510 - Posted: 9 Nov 2019, 22:13:51 UTC - in response to Message 61303. Someone over at WCG seemed to think 5MB cache was what a MIP1 job would like. The user offered no justification for that number but 4MB probably isn't enough for near-optimum performance. I wonder what that is really about. I ran three MIP1 jobs today (one at a time) with three CPDN N216 jobs on my other three cores. The MIP1 job used about 2% of my RAM whereas the N216 jobs take 8.5% each. I have 16 GBytes RAM. 64-bit processor. Were they talking about disk cache? That would not make much sense. Or processor cache? My processor has Cache 10240 KB. ID: 61510 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 493 Credit: 31,669,049 RAC: 10,904	Message 61511 - Posted: 9 Nov 2019, 23:56:55 UTC New for me having been only Windows for many years have now got a VM running Ubuntu 18.04 (one core out of 4 on a 3.3GHz i5, allocated 3Gb RAM) have now got one of these to go with an N144. Will see how it goes. ID: 61511 · Reply Quote