climateprediction.net (CPDN) home page
Thread 'UK Met Office HadAM4 at N216 resolution'

Thread 'UK Met Office HadAM4 at N216 resolution'

Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61512 - Posted: 10 Nov 2019, 1:33:21 UTC - in response to Message 61511.  

New for me having been only Windows for many years have now got a VM running Ubuntu 18.04 (one core out of 4 on a 3.3GHz i5, allocated 3Gb RAM) have now got one of these to go with an N144. Will see how it goes.


3GBytes of RAM for a N144 task? That seems like a lot.
I am running four N216 tasks on Red Hat Enterprise Linux Server release 6.10 (Santiago) ...
CPU type 	        GenuineIntel
                        Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7]
Number of processors 	4
Operating System    	Linux 2.6.32-754.23.1.el6.x86_64
Memory 	                15.5 GB

and they are taking 1.35 GB of virtual memory (each) and the working set size is 1.33 GBytes. I do not have any N144 tasks at the moment.
ID: 61512 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 61513 - Posted: 10 Nov 2019, 1:34:19 UTC - in response to Message 61460.  
Last modified: 10 Nov 2019, 2:13:08 UTC

The project are looking at measures to deal with the large number of computers missing the 32bit libraries or otherwise crashing everything on these and other Linux tasks.

Each task will now get a maximum of 5 attempts rather than the three we have seen in the past. This may mean a small increase in the number of tasks that fail on all computers being received. They are also looking at blocking computers from getting tasks till they sort things out. If the latter is effective I guess the former may only be a temporary measure.

One particularly egregious host was found that had crashed over a thousand of these!


I just saw hadam4h task from batches 842 and 843 with a _3 and a _4 tail download on one of my machines. Glad to take them.

<edit>
Exec summary
These hadam4h N216 models can run in about a week on even old machines.
How many cores to load before thruput drops off fast -- that's the question
<edit>


Unfortunately I've a 2-to-3-week backlog of these wu's mostly account of early underestimated completion times and optimism about how much help the extra cores on recent Intel and AMD might give.
It seems there is a "sweet spot" for various cpus of various ages and designs, a minimum that depends on the generation, microarch, and load. I can't generalize yet, but here's 3 examples.

AMD Phenom II 6-core completes 1 model in 5.7 days no other load. running 3 models on the 6 "cores" all 3 at once take 19.6 days to finish or 6.5 days per model

Intel i7700 running 4 models all 4 cores, (no threading) 8.8 days for 4 models 2.2 per each. 2 hadam4h on this box takes 6.2 days; 3.1 days each. Running 3 at once takes (estimated) 7.1 days or 2.4 cpu days each. In this case it pays well to run 4 models (compare sandy bridge thru Haswell)

Intel i8700K running 6 models so loading all 6 "cores" 12 days to finish 6 models 2days per model. Barely better than the 6.2 days to complete 3 models with only 3 at once. and sure better througput than running only 2 models on the 6-core box, when estimate is 5.3 days for 2 to complete -- 2.6 days per model

Doh.

Hope to report more on the speed vs core-count thing on newer Ryzen 2700x and 3900X - first observation is that using all 8 or 12 cores gets no more throughput than using 4 or 6 right around a week per model. (Don't know how all that 64M L3 on the 3900X is shared - many other questions.)
ID: 61513 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,000,748
RAC: 14,638
Message 61516 - Posted: 10 Nov 2019, 23:03:40 UTC - in response to Message 61512.  

"3GBytes of RAM for a N144 task? That seems like a lot."

That's what is allocated to the virtual machine.
ID: 61516 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61522 - Posted: 11 Nov 2019, 10:10:28 UTC

One just failed with a seg fault, shortly before first zip so no credit from that one. :(
Not that I have looked at my credit for some time anyway.
ID: 61522 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,000,748
RAC: 14,638
Message 61529 - Posted: 13 Nov 2019, 23:21:26 UTC - in response to Message 61516.  

The N144 task has just completed in about 160 hours run time.
ID: 61529 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61541 - Posted: 15 Nov 2019, 19:58:41 UTC - in response to Message 61522.  
Last modified: 15 Nov 2019, 20:01:59 UTC

One just failed with a seg fault, shortly before first zip so no credit from that one. :(

I have something a bit more unusual. All the ones shown as "completed" on my Ryzen 3700x actually have this in the stderr:

Signal 15 received: Software termination signal from kill 
Signal 15 received: Abnormal termination triggered by abort call
Signal 15 received, exiting...
SIGSEGV: segmentation violation

https://www.cpdn.org/results.php?hostid=1493935

That is interesting.

EDIT: I see it occasionally on my other machines too. Maybe you have to fail in order to succeed.
ID: 61541 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61542 - Posted: 15 Nov 2019, 20:32:47 UTC - in response to Message 61541.  

It might be caused by travelling at such high speed.
Mine travel at a more sedate speed, and stop by easing down on the brake. :)

Task 21759776

Well, when I say sedate speed, I have been getting complaints from snails wanting to get past. :)
ID: 61542 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61543 - Posted: 15 Nov 2019, 20:45:37 UTC - in response to Message 61542.  

Well, now that you put me up to it, I checked the last twelve successful completions on all machines.
Nine of them show the segfaults, and three don't. Some of each are from each machine, both Intel and Ryzen.

I won't try to do a statistical analysis.
ID: 61543 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,000,748
RAC: 14,638
Message 61574 - Posted: 18 Nov 2019, 23:21:29 UTC - in response to Message 61529.  

My N216 task has reached 25% after about 4.5 days. I've also got 2 batch 848, one of which is 15% after 1day.
ID: 61574 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,000,748
RAC: 14,638
Message 61622 - Posted: 28 Nov 2019, 23:09:44 UTC - in response to Message 61574.  

This has now reached over 50% with time steps between 43 and 45 sec/ts.
ID: 61622 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61676 - Posted: 12 Dec 2019, 19:45:14 UTC

Looking at the failures for these and the N144 tasks, most of the failures that are not down to missing 32bit libs are from machines running large numbers of these tasks concurrently. (Often 16, 24 or even more in some cases.) Some of these machines are failing nearly everything they get. If this applies to you, please try under computing preferences, reducing the number of cores in use. A lot of computer time is being wasted by this issue.
ID: 61676 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,610,380
RAC: 3,377
Message 61677 - Posted: 12 Dec 2019, 21:50:52 UTC - in response to Message 61676.  

It's a matter of statistics. There's simply more chance that you encounter a machine running many tasks concurrently, so it doesn't need to be the cause.
ID: 61677 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 61861 - Posted: 31 Dec 2019, 4:31:06 UTC - in response to Message 61543.  

Well, now that you put me up to it, I checked the last twelve successful completions on all machines.
Nine of them show the segfaults, and three don't. Some of each are from each machine, both Intel and Ryzen.

I won't try to do a statistical analysis.


For me, these "N216" have never failed. They take almost two weeks but unlike the "144" things -- no fails here
Give me more of the "N216"
ID: 61861 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 61862 - Posted: 31 Dec 2019, 4:47:48 UTC - in response to Message 61676.  

Looking at the failures for these and the N144 tasks, most of the failures that are not down to missing 32bit libs are from machines running large numbers of these tasks concurrently. (Often 16, 24 or even more in some cases.) Some of these machines are failing nearly everything they get. If this applies to you, please try under computing preferences, reducing the number of cores in use. A lot of computer time is being wasted by this issue.

Me,I say, never load the "Ncore-2Nthread" CPU more than "Ncore" .AND for the many-core 6-16 core cpu's -- allow less than the number of cores to run these N216 models.
Preliminary stats from my fastest boxes show that the "biggie manycore boxes" I'm running are most productive at about half the "core-count" -- I've not any "Threadripper or i9-10"
BUT - the RYZEN 9- 39xx is twice as fast on these L2 and L3 hogs as any other "not-quite-bleeding-edge" box I'm running

You can look at my public stats.
Near zero fails on the N216 batches.
Random fails on the N144 batches.

I've no clue why
ID: 61862 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61863 - Posted: 31 Dec 2019, 7:44:19 UTC - in response to Message 61861.  

For me, these "N216" have never failed. They take almost two weeks but unlike the "144" things -- no fails here

It looks like you are running the Linux 5.2.0 or 5.3.0 kernel. That may help.
ID: 61863 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,026,382
RAC: 20,431
Message 61864 - Posted: 31 Dec 2019, 11:16:15 UTC

Looking at the batch statistics for the most recent 214 and 216 batches, the 216 is at 7% hard fails and the 214 at 13% and not all of them have gone out yet. To do a completely fair comparison I should probably wait till more have finished but it does suggest to me that at least between #852 and #853, the 144 batch ones are more likely to fail.

On the 216 batch, of the first five hared fails listed so all failed 3 times, only two out of the fifteen fails are not missing libraries.

On the 144 batch there are six non missing library fails out of the 15.

Probably need to look at more than just ten tasks to confirm reasons but the 144's seem more prone to invalid theta errors which means they are pushing the limits of the models and producing impossible climates such as -ve air pressure which is one I have seen in the past. This may well end up being different for some N144 batches and it can't be assumed that these figures apply to all and putting different numbers into the initial conditions of batches may change things a lot.

Some of the tasks involved are not really hard fails yet as the number of attempts for Linux tasks has been upped to four.
ID: 61864 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61865 - Posted: 31 Dec 2019, 14:14:34 UTC - in response to Message 61864.  

This one failed for me due to negative pressure.
I do have the required 32-bit libraries as evidenced by other 144 work have completed successfully, as have the 216 ones.

Task 21866633
Name hadam4_a06t_209110_6_856_011961960_3
Workunit 11961960

My failed one says.

Workunit 11961960
name hadam4_a06t_209110_6_856_011961960
application UK Met Office HadAM4 at N144 resolution
created 9 Dec 2019, 1:30:09 UTC
minimum quorum 1
initial replication 1
max # of error/total/success tasks 5, 5, 1
errors Too many total results

No need to do anything about this, I suppose: just another data point.
ID: 61865 · Report as offensive     Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 81
Credit: 14,024,464
RAC: 5,225
Message 61967 - Posted: 11 Jan 2020, 1:04:03 UTC

Making sure I understand this correctly.
I have a dual Xeon 2670v2 with 25 MB l3 cache per processor.
Will Boinc and the workunits see this as 50 total - thus I should be able to run 10-12 concurrently? (leaning more toward less at this point)
ID: 61967 · Report as offensive     Reply Quote
lazlo_vii

Send message
Joined: 11 Dec 19
Posts: 108
Credit: 3,012,142
RAC: 0
Message 61968 - Posted: 11 Jan 2020, 1:19:42 UTC - in response to Message 61967.  

It's not really about how BOINC sees it. It is complicated to answer because it has to do with how the kernel's process scheduler and memory management more data into and out of RAM to and from the CPUs. The kernel can only get data in/out of RAM at the speed of the RAM and it can only move data between CPU cores at the speed of the CPU die interconnect and it can only move data between CPU sockets at the speed of the motherboard's North Bridge controller.

My advice is to start with less and work up to more until you find a sweet spot.
ID: 61968 · Report as offensive     Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 81
Credit: 14,024,464
RAC: 5,225
Message 61973 - Posted: 12 Jan 2020, 4:38:36 UTC - in response to Message 61968.  

It's not really about how BOINC sees it. It is complicated to answer because it has to do with how the kernel's process scheduler and memory management more data into and out of RAM to and from the CPUs. The kernel can only get data in/out of RAM at the speed of the RAM and it can only move data between CPU cores at the speed of the CPU die interconnect and it can only move data between CPU sockets at the speed of the motherboard's North Bridge controller.

My advice is to start with less and work up to more until you find a sweet spot.

That makes complete sense and clarifies a lot. Thanks. This probably goes for a lot of projects.
ID: 61973 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution

©2024 cpdn.org