climateprediction.net (CPDN) home page
Thread 'UK Met Office HadAM4 at N216 resolution'

Thread 'UK Met Office HadAM4 at N216 resolution'

Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61411 - Posted: 27 Oct 2019, 19:28:19 UTC

When we were running the big production lot of OpenIFS to prove that it could be done, my machine was getting around 1 hour and 15 minutes for 4.
When I increased this to 8, it slowed to 2 hours and 30 minutes.

When I mentioned this to the main researcher, he said that was normal, and that he didn't run any on hyperthreading.
ID: 61411 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61412 - Posted: 27 Oct 2019, 19:53:45 UTC - in response to Message 61411.  

When we were running the big production lot of OpenIFS to prove that it could be done, my machine was getting around 1 hour and 15 minutes for 4.
When I increased this to 8, it slowed to 2 hours and 30 minutes.

When I mentioned this to the main researcher, he said that was normal, and that he didn't run any on hyperthreading.


And the slowdown on my laptop with only 2GB/core (4 cores no hyperthreading) was an even bigger percentage if I ran all four. Best throughput was running two at once. Lack of memory meant a lot of swapping out to disk. All reasons why I am looking forward to a new faster machine!
ID: 61412 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61433 - Posted: 31 Oct 2019, 11:04:27 UTC - in response to Message 61382.  

What the long time between checkpoints does mean for these tasks is that on computers that get switched off several times a day will never finish because if they have not reached the first checkpoint they will restart from the beginning.


I just started
Name 	hadam4h_a18g_201111_4_842_011906056_2
Workunit 	11906056
Created 	30 Oct 2019, 12:56:26 UTC
Sent 	30 Oct 2019, 16:46:39 UTC
CPU time at last checkpoint     12:45:28
CPU time                        12:51:34
Elapsed time                    14:21:29


So they certainly checkpoint more often than they trickle.
Remember, when looking at these times, that my machine has a 1.8 GHz 4-core 64-bit Xeon processor that runs at about half the speed of current processors.
ID: 61433 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61434 - Posted: 31 Oct 2019, 11:30:07 UTC - in response to Message 61433.  

CPU time 19:09:58
CPU time since checkpoint 02:08:53
Elapsed time 19:45:39
Estimated time remaining 5d 15:14:14
Fraction done 2.806%


I also have an old computer. The fastest machines are finishing at least three times as fast as this one.

Pentium(R) Dual-Core CPU E5400 @ 2.70GHz [Family 6 Model 23 Stepping 10]
ID: 61434 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61443 - Posted: 1 Nov 2019, 14:36:49 UTC

After running fine for about a week on my Ryzen 3700x, I needed to reboot it. There did not appear to be any problems, and the work units I saw running thereafter were proceeding normally, with no errors shown at my end.

But when I checked the Tasks page, I see that six of them show as "aborted".
The message is: "203 (0x000000CB) EXIT_ABORTED_VIA_GUI".
https://www.cpdn.org/results.php?hostid=1493935

It is possible that these are the ones that had not started yet, and were waiting in the queue. They all show as 0.00 seconds run time.
And they each show as "Error while computing" (not "aborted") by two other users, also with short run times.

It is not a problem for me, but I thought the developers might be interested how it happened.
ID: 61443 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 61444 - Posted: 2 Nov 2019, 5:52:10 UTC

Jim1348 - re your Ryzen tasks...

I followed the link in your post above to see how you might be getting on with tasks that didn't abort(!) and was intrigued to see tasks taking well over 20 seconds per time step. So far I've finished two (each took about 7 days, 7hours) and in both cases the average per time step is under 18 seconds...

So I wondered how many you are running at a time and, perhaps, what your overall workload is on that system. On mine I only allow 2 CPDN at a time (and I also only allow 2 WCG MIP1 (cache-killers!)) - I also only let BOINC have 14 out of 16 "CPUs"

Fun machines, aren't they! I could write an essay about the machine turning the CPU fan off with 14 tasks running, but I won't...

Cheers - Al.
ID: 61444 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61446 - Posted: 2 Nov 2019, 12:04:39 UTC - in response to Message 61444.  
Last modified: 2 Nov 2019, 12:08:45 UTC

I followed the link in your post above to see how you might be getting on with tasks that didn't abort(!) and was intrigued to see tasks taking well over 20 seconds per time step. So far I've finished two (each took about 7 days, 7hours) and in both cases the average per time step is under 18 seconds...

So I wondered how many you are running at a time and, perhaps, what your overall workload is on that system. On mine I only allow 2 CPDN at a time (and I also only allow 2 WCG MIP1 (cache-killers!)) - I also only let BOINC have 14 out of 16 "CPUs".

That is a long (long) story, that I have posted on at some length. Yes, limiting the cores helps. I have found what works, more or less, and am just finishing up some now.
But in a fit of hope over experience, I am putting together a Ryzen 3600 tomorrow, and will see what it does. It has only six full cores (with HT disabled), with lots of cache. It might work.
Thanks for the input.
ID: 61446 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61449 - Posted: 4 Nov 2019, 5:55:53 UTC

I now have four UK Met Office HadAM4 at N216 resolution v8.52
i686-pc-linux-gnu work units running on my four cores.

21785482 21761296 21784271 21760249

One has completed three trickles, but the other three have only been running a short while and have not produced a trickle.

Those three trickles were running 52.3026 seconds/TS for the 25% complete one
52.2617 seconds/TS for the 50% complete one and
52.3267seconds/TS for the 75% complete one.

Each is getting over 97% of the cpu time of the processor it runs on. The processor is not a fast one by today's standards, but it seems to have an unusually large cache for what it is.
CPU type 	GenuineIntel
Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7]
Operating System 	Linux 2.6.32-754.23.1.el6.x86_64
BOINC version 	7.2.33
Memory 	                15.5 GB
Cache 	               10240 KB
Swap space 	        3.91 GB
Total disk space 	117.21 GB
Free Disk Space 	 98.49 GB
Measured floating point speed 	1.27 billion ops/sec
Measured integer speed 	3.53 billion ops/sec
Average upload rate 	        3009.79 KB/sec
Average download rate 	        9768.13 KB/sec

ID: 61449 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 61456 - Posted: 5 Nov 2019, 5:09:02 UTC
Last modified: 21 Feb 2021, 3:24:43 UTC

I made some modifications to my previous post on the comparison of CPUs, their L3 cache, the model speed, and how much the models slow down as more are added.

I added a Ryzen 2600X, and compared 1 vs. 2 vs. 4 models and their speeds.

CPU            L3 cache    1 model          2 models (% slower than 1)  4 models (% slower than 1)
Ryzen   5600X  (32 MB)      9.5 sec/TS       9.9 sec/TS ( 4%)           11.8 sec/TS (24%)
Ryzen   3600X  (32 MB)     11.2 sec/TS      11.4 sec/TS ( 2%)           13.6 sec/TS (21%)
Ryzen   2600X  (16 MB)     13.0 sec/TS      13.7 sec/TS ( 5%)           17.8 sec/TS (37%)
Haswell 4790K  ( 8 MB)     13.9 sec/TS      15.9 sec/TS (14%)           22.0 sec/TS (58%)
ID: 61456 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61458 - Posted: 5 Nov 2019, 6:59:40 UTC

Interestingly, the Rainfall Africa tasks over at World Community Grid seem to be similarly affected. Running one alongside an N216 task doesn't make my ageing desktop as sluggish as running 2 N2n6's but does at times slow down its responsiveness. Just another thing to be aware of.
ID: 61458 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61460 - Posted: 5 Nov 2019, 12:15:55 UTC

The project are looking at measures to deal with the large number of computers missing the 32bit libraries or otherwise crashing everything on these and other Linux tasks.

Each task will now get a maximum of 5 attempts rather than the three we have seen in the past. This may mean a small increase in the number of tasks that fail on all computers being received. They are also looking at blocking computers from getting tasks till they sort things out. If the latter is effective I guess the former may only be a temporary measure.

One particularly egregious host was found that had crashed over a thousand of these!
ID: 61460 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 61477 - Posted: 6 Nov 2019, 14:05:10 UTC

Before I spend lots of time trying to optimise my set up (Ryzen 2600 running Ubuntu 18.04 currently running 4 HADAM4/N216, 4 WCG and 4 Rosetta) I’d like to understand how the credits are allocated.

I’m sure I saw in the faq that credits follow the trickle and will be up to 12 hours later. My first set of trickles saw credits about 3 days later but since then nothing, I’ve had 7 more trickles the earliest of them over 4 days ago.

This is looking at my account on cpdn.org so does not take account of any delay posting the credits to BOINC stats.
ID: 61477 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,815,352
RAC: 5,242
Message 61478 - Posted: 6 Nov 2019, 14:23:25 UTC - in response to Message 61477.  

[Bryn Mawr wrote:]... I’d like to understand how the credits are allocated. ...

All being well, the project runs a credit allocation script at weekly intervals, so the allocated credits (such as in the Statistics tab in BOINC Manager) will show jumps as the accumulated trickles are processed. Sometimes the credit script fails and no credits are allocated, at which point someone here will inevitably prompt the project to re-run the script, if they haven't spotted it themselves.

Welcome to the message board!
ID: 61478 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61479 - Posted: 6 Nov 2019, 16:13:51 UTC
Last modified: 6 Nov 2019, 19:33:51 UTC

I’m sure I saw in the faq that credits follow the trickle and will be up to 12 hours later. My first set of trickles saw credits about 3 days later but since then nothing, I’ve had 7 more trickles the earliest of them over 4 days ago.


It is many years since I looked at the faq. I will do so later this evening and if it still says that I will prompt the project to change it. Pretty sure that course will be effective but it may take a week or so.

Edit: looking at the FAQ page, I didn't read it all but it does state that the credit script runs only once per day rather than once per week. I have emailed the project suggesting this is updated.
ID: 61479 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 61480 - Posted: 6 Nov 2019, 17:53:26 UTC - in response to Message 61478.  

[Bryn Mawr wrote:]... I’d like to understand how the credits are allocated. ...

All being well, the project runs a credit allocation script at weekly intervals, so the allocated credits (such as in the Statistics tab in BOINC Manager) will show jumps as the accumulated trickles are processed. Sometimes the credit script fails and no credits are allocated, at which point someone here will inevitably prompt the project to re-run the script, if they haven't spotted it themselves.

Welcome to the message board!


Many thanks, I’ll stop worrying about the trivia and start looking at the optimization.

It looks to be a knowledgable and friendly group here :-)
ID: 61480 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61483 - Posted: 7 Nov 2019, 7:35:11 UTC
Last modified: 7 Nov 2019, 12:47:48 UTC

The long time between checkpoints on this batch seems to have one useful side effect. (Following an update to ermine my desktop started suffering from a black screen after I had gone away so l would have to reboot to get back in to it meaning I had lots of reboots before I finally resolved the issue.)

I guess a few more machines might be needed to prove it but to my mind at least the fact that neither task has crashed yet confirms that it is stopping BOINC while a checkpoint is being written that causes the tasks to crash. Over an hour between checkpoints on this box means chances of doing something during one is greatly reduced making the tasks more robust. Further evidence that this might be the case is that the only reasons I have seen on the crashed tasks for this batch are missing 32bit libs and one machine with 40 cores that has too little memory and so is crashing everything because of that.

Edit:However return rate seems to be much lower than expected so checkpoint interval will be decreased. This is because tasks can do two hours or even more crunching, then if BOINC is stopped, they start again at the beginning. So again, please suspend rather than switching off computers if running these tasks on machines that are switched off regularly.
ID: 61483 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61485 - Posted: 7 Nov 2019, 14:08:46 UTC

I can now post progress in finding a CPU that works really well, after somewhat disappointing results on other machines.

My new Ryzen 3600 allows me to run six work units (50% of the cores), with each now estimating a completion time of 7 days 18 hours (48% complete).
https://www.cpdn.org/results.php?hostid=1494480
The trickles are at 19 sec/TS.

Curiously, it is better than a Ryzen 3700x, which could run only about four cores efficiently, even though it is the same architecture as the 3600.
It seems that they make use of the cache differently. It is also better than an i7-9700, where I could run only four cores effectively.

The 3600 comes at a nice discount compared to the other members of the family, so would make a great machine for new builders.
ID: 61485 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61488 - Posted: 7 Nov 2019, 14:43:16 UTC - in response to Message 61485.  

The 3600 comes at a nice discount compared to the other members of the family, so would make a great machine for new builders.


Thank you, I am looking at upgrading my machine along the lines of the axe that has been in the family for generations. In this case (sic) the blade will be the motherboard, memory and cpu, the handle will be the hard disks.
ID: 61488 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61510 - Posted: 9 Nov 2019, 22:13:51 UTC - in response to Message 61303.  

Someone over at WCG seemed to think 5MB cache was what a MIP1 job would like. The user offered no justification for that number but 4MB probably isn't enough for near-optimum performance.


I wonder what that is really about. I ran three MIP1 jobs today (one at a time) with three CPDN N216 jobs on my other three cores.
The MIP1 job used about 2% of my RAM whereas the N216 jobs take 8.5% each. I have 16 GBytes RAM. 64-bit processor.

Were they talking about disk cache? That would not make much sense.
Or processor cache? My processor has Cache 10240 KB.
ID: 61510 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,000,748
RAC: 14,638
Message 61511 - Posted: 9 Nov 2019, 23:56:55 UTC

New for me having been only Windows for many years have now got a VM running Ubuntu 18.04 (one core out of 4 on a 3.3GHz i5, allocated 3Gb RAM) have now got one of these to go with an N144. Will see how it goes.
ID: 61511 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution

©2024 cpdn.org