climateprediction.net (CPDN) home page
Thread 'Hyperthreading'

Thread 'Hyperthreading'

Message boards : Number crunching : Hyperthreading
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46239 - Posted: 16 May 2013, 16:37:37 UTC

Just experimenting with hyperthreading & modifying the number of simultaneously running HadCM3 models on my PC.

It's an Intel i3770K with 4 real cores and 2 threads per core (showing up as 8 processors in the task manager). Windows 7 64-bit.


  • 0.63s/ts for 4 models, 4 cores = 181 hours per model, 7.5d, 1 model per 45 hours
  • 0.90s/ts for 6 models, 6 cores = 259 hours per model, 10.8d, 1 model per 43 hours
  • 1.16s/ts for 8 models, 8 cores = 334 hours per model, 13.9d, 1 model per 42 hours



Looking at the throughput figures (45 hours with 4 models running versus 42 hours with 8 models running) I was surprised how little advantage there seems to be when running with all 8 processors available. I was expecting much better throughput with 8 running. The numbers may of course change when I have collected more data.

Is this consistent with other people's experience of hyperthreading?

The HadCM3s are 25,920 ts/trickle, 40 trickles (1,036,800 ts) = 1 model.


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46239 · Report as offensive     Reply Quote
skgiven
Avatar

Send message
Joined: 5 Jun 06
Posts: 28
Credit: 2,790,048
RAC: 0
Message 46240 - Posted: 16 May 2013, 17:51:15 UTC - in response to Message 46239.  
Last modified: 16 May 2013, 18:17:58 UTC

Interesting results. It's a bit less than I would have hoped for but I wouldn't have expected much more. The more threads you use the less efficient the extra threads become and this project is quite heavy.

You would only do 8.6% more work running 8 models than running 4 models, and running 8 is only 3.4% more productive than 6.
You might actually find that running 7 climate models is about the same as 8, possibly better (as has been documented at other projects).

A few considerations:
I don't know what your clocks are at, but they depreciate when you run more WU's, from 3.9GHz to 3.5GHz at stock clocks. Which is why I fix mine at 4.2GHz.
Memory and drive I/O contention would mean competition for resources.
The system always uses some resources, even in Best Performance mode and with Aero off... It's probably the case that memory speeds impact on the results here, and more-so with higher numbers of WU's in use. You really need to compensate for using dual channel by getting fast memory modules. Fortunately 2400 is fairly inexpensive these days.

What frequency is your CPU running at, your memory modules and are you using an SSD?

Of note is that Climate models actually use more electric than many projects:

i7-3770K @ 4.2GHz, 8GB 2133 RAM, SATA6 SSD drive
System usage 77W (includes 2 GPU's idle power of 7W each)
With no GPU tasks running,
1 CPU BoincSimap WU�s � System usage 91W
2 CPU BoincSimap WU�s � System usage 104W
1 CPU Ibercivis WU�s � System usage 93W
2 CPU Ibercivis WU�s � System usage 106W
1 CPU Climate WU�s � System usage 95W
2 CPU Climate WU�s � System usage 112W

This has always been the case.

  • 142,947.26 142,396.37 2,386.39 2,386.39 UK Met Office HADAM3P European Region v6.09 - 39.7h while running 2 climate models, 3 other CPU WU's and 2 GPU WU's



In my opinion it's generally better to leave one CPU thread free, as there is little or nothing to be gained from using the last thread. Doing so improves GPU app performances, system responsiveness and reduces errors.


ID: 46240 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 46241 - Posted: 16 May 2013, 18:21:21 UTC

Hi Mike,

It varies, but on my i7 920 in Linux, running 8 models with HT on results in about 10-17% greater throughput than turning HT off and running on 4 cores. It's not all that different than the old P4 with HT in terms of % increased throughput. This was with triple channel DDR3 1333 memory.
ID: 46241 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46243 - Posted: 16 May 2013, 18:42:42 UTC - in response to Message 46240.  
Last modified: 16 May 2013, 20:37:38 UTC


skgiven wrote:
...I don't know what your clocks are at, but they depreciate when you run more WU's, from 3.9GHz to 3.5GHz at stock clocks. Which is why I fix mine at 4.2GHz.
Memory and drive I/O contention would mean competition for resources.
The system always uses some resources, even in Best Performance mode and with Aero off... It's probably the case that memory speeds impact on the results here, and more-so with higher numbers of WU's in use. You really need to compensate for using dual channel by getting fast memory modules. Fortunately 2400 is fairly inexpensive these days.

What frequency is your CPU running at, your memory modules and are you using an SSD? ...


Mine is also 4.2GHz (I started with a stable O/C of 4.5, and then dropped the frequency to add a safety margin). Water cooled, but temperatures are hovering around the 65c mark (with 6 models running) which seems high to me... don't yet know the temperatures with 8 models running because I am operating the PC remotely. I will take a look when I am back tonight. (EDIT - upper 60s).

Memory is 16GB (2 sticks of Corsair vengeance DDR3 1600, 10,10,10,27, running stock).

I *do* have an SSD but CPDN is not using it (deliberately). The setup is : System drive is 3TB accelerated with an mSata 64gb Intel 525 SSD using ISRT, and the Boinc drive is a 1TB (all 3 drives using Sata6). The reason that I did not want to use the SSD with CPDN is due to the vast amount of data written* which I was afraid would burn through the write endurance of the MLC SSD in no time).

* Although it has been years since I last actually looked at the amount of data written, perhaps it has changed since 2007 LOL.
EDIT - 3.8GB written per hour per model, that's with 8 models running. Probably about 7gb/hour/model if running 4. In total around a terabyte of writes every day and a half. That's more than 30 times higher than the recommended duty cycle for the 525 SSD. An SLC SSD would be fine of course.

PS Setting up ISRT on a 3TB bootable drive was a complete nightmare. Not sure I would try that again...


...
In my opinion it's generally better to leave one CPU thread free, as there is little or nothing to be gained from using the last thread. Doing so improves GPU app performances, system responsiveness and reduces errors.


I was finding that the system was working well with 6 cores, didn't try 7 although I will give that a go.




Geophi wrote:
Hi Mike,

It varies, but on my i7 920 in Linux, running 8 models with HT on results in about 10-17% greater throughput than turning HT off and running on 4 cores. It's not all that different than the old P4 with HT in terms of % increased throughput. This was with triple channel DDR3 1333 memory.


Sounds like you're getting better results from threading than me, but it might just be due to me not having enough data yet. I think it would be a good idea to run for a bit longer to get better averages...


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46243 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,517,986
RAC: 17,587
Message 46244 - Posted: 16 May 2013, 20:42:43 UTC

Haven't benchmarked resently, but did benchmark my i7-920 when it was new. It's running stock speed, 6 GB memory in triple-channel-mode and probably at 1066 MHz-memory-speed. Benchmarked on, if not mis-remembers, a Hadam3P-model, since AFAIK it was before the various regional models was released. In any case, the results are:

1 instance to 1st. trickle: 6237 seconds/trickle.
4 instances to 3rd. trickle and averaging: 8456 seconds/trickle.
8 instances to 3rd. trickle and averaging: 14002 seconds/trickle.

Meaning, running 4 instances it's performing as a 3-core-computer, while running 8 instances it's performing as a 3.5-core-computer. The advantage of running 8 over 4 instances was 21%.

The benchmarking was done without any turbo-boost (don't even remember if it has turbo at all...), and the same model was re-run from beginning to remove any effects from variable speed during a model and between models.


As for the i3770K not having the same HT-effect, my 1st. suggestion would be turbo-boost, but if you've disabled this... Another possibility is, the i7-920 is triple-channel, even at 1066 MHz it's got the same total bandwidth as the i3770K running at default dual-channel 1600 MHz-memory-speed. Since the cpu is also faster, it's possible the memory-bandwidth is saturated earlier than on the slower i7-920.
ID: 46244 · Report as offensive     Reply Quote

Message boards : Number crunching : Hyperthreading

©2024 cpdn.org