|
Questions and Answers : Windows : Optimise PC build for CPDN
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Problems with Models Crashing Computer ID 1290283 So far all my tasks have crashed and I'm suspending calculations for the moment. 2 early ones model errors as noted in another thread 2 caused by me as below. 2 at Time Step 259,200 (Tasks 15966049, 15998767) 3 at Time Step 518,400 (Tasks 15940327, 15942872, 15965966) I assume these timesteps are the 25 & 50% marks for the hadcm3n models. Remembering other threads about this I assume this is a problem with the hard drives not getting the data out quickly enough as highlighted in Greg's post here. It seems a possibility that the PC is generating too much data for the older drive BOINC sits on. Am I correct in this, and if so, what's the best approach? I'm quite happy to put a faster drive in and that includes a SSD if necessary. Yes I know the SSD life could be short, but an Intel 520 should be good for 2 years? By that time something else will be along anyway. A 120GB SSD around $240, 1TB Seagate HDD around $110. Would need the 1TB drive to get the higher data speeds. Or, are there any ways in which Windows can be manipulated to speed things up. I do have a largish number of drives, don't know if that makes any difference. Any thoughts anyone? BTW, thanks for the comments all and the link Greg. I'm sure there was something more related to CPDN somewhere, but perhaps that was the old BB. Martin |
Send message Joined: 19 Apr 08 Posts: 179 Credit: 4,306,992 RAC: 0 |
Two minutes waiting for shutdown tasks to write sounds pretty out-of-whack. Could it have something to do with that ISRT for SSD's? |
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I would recommend against an SSD for CPDN. I calculated my 64mb Intel 520 would only survive around 6 weeks in theory (each model generates something like a terabyte of writes over its life), so I moved the Boinc data directory onto a physical disk. A bigger disk, or single-level-cell flash would last longer. I see both 'signal 11' and 'code 193' in the status of those jobs. Does the time of the crashes correspond to anything particular? As a starting point: * Change your settings to 'Leave tasks in memory when suspended' = Y, 'suspend if CPU usage is above %' to 0%, 'Use at most ... % of CPU' to 100.00. This will prevent the model being swapped out of memory. * Make sure that the Boinc data directories are excluded from any antivirus scans If the crash always happens at the moment that the zip files are generated (25%, 50%, 75%) then I would be looking at the antivirus software on that PC first, it may be interfering. Obviously you shouldn't turn off antivirus, but you can exclude the appropriate directories (which may include c:/temp/ if the files are generated there). Here is a good post regarding error numbers: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7592&nowrap=true#46161 If neither of these helps, try running a 'stress test' for 24 or 48 hours on the PC. I use prime95 (one copy per thread, to test the CPU), and memtest86+ (to test the memory). I'm a volunteer and my views are my own. News and Announcements and FAQ |
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Two minutes waiting for shutdown tasks to write sounds pretty out-of-whack. ... Well, he has 16 threads so there are an awful lot of models to shut down. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 19 Apr 08 Posts: 179 Credit: 4,306,992 RAC: 0 |
If the ISRT cache is large enough, Windows may decide to copy the BOINC data directory there (explaining the long shutdown time). Then you could perform the trick of hotplugging the BOINC HDD without crashing. Of course with this kind of caching you're still subjecting your SSD to a lot of writes. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Thanks Guys, From Mike's post. > Does the time of the crashes correspond to anything particular? Not as far as I'm aware. In fact the last one happened when I was sitting here looking at the CPDN results web pages & Std Err to figure out what was going wrong. The task in BOINC Manager was sitting at 49.xx% and the next time I looked back it was at 100% and crashed. A few browser pages open, email etc, but no power surges that I noticed. These were the entries in the Event Log at the time (Task 15965966):
5/09/2013 7:44:03 p.m. | climateprediction.net | Not requesting tasks: don't need 5/09/2013 7:44:08 p.m. | climateprediction.net | Computation for task hadcm3n_o4fj_1980_40_008408049_1 finished 5/09/2013 7:44:08 p.m. | climateprediction.net | Output file hadcm3n_o4fj_1980_40_008408049_1_2.zip for task hadcm3n_o4fj_1980_40_008408049_1 absent 5/09/2013 7:44:09 p.m. | climateprediction.net | Scheduler request completed
|
Send message Joined: 19 Apr 08 Posts: 179 Credit: 4,306,992 RAC: 0 |
...,but digging around it seems to only work when configured as RAID. No, all it requires is an SSD and and HDD. It sounds like it will be active by default on Intel 68 and up even without the configuration program installed (the driver is part of Win7/8). Ed: actually you're right. For it to work a single HDD must be put in RAID mode--pretty weird. |
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
... It sounds like it will be active by default on Intel 68 and up even without the configuration program installed (the driver is part of Win7/8). It won't use ISRT by default unless it is set up in both the Bios & Windows... it took me several days and much swearing before I could get ISRT working with my PC! (I set it up on my main disk, and have a secondary disk for Boinc which is not cached). ... How do we know for certain this is where BOINC builds the zipped files? I notice a file in c:/windows/temp/ called DMI3FCD.tmp of 0KB and timed at 17:37 (NZ time), which was the time task 15998767 crash. Could be it? ... Well, it is probably associated with the model, but whether it was actually the starting point for a .Zip or not I cannot tell. That's a typical name for a temporary file when requested by something using the Windows API. But try excluding it, and see if that helps. ... PC had a 40 hour burn before it left the shop, and a 30 hour one after I had installed all the extra gear and software, but could run it again in the weekend if it was going to be beneficial. ... Well, if it's already had a stress-test done, there is little point in doing it again. ... the drive that houses BOINC ... That sort of suggests that you have other drives available. Just as an experiment, it may be worth moving the Boinc Data folder over to a different drive (as long as it isn't an SSD), to see if you can see a difference. Speaking of which, in the (very) old days it used to be possible to set up a RAM disk at system startup, and store it at shutdown. While that wouldn't be a good idea for the models (since you would risk losing progress if the system shuts down unexpectedly), it may also be worth experimenting with. The overall impression I am getting from this is that the problem is not stability (otherwise the crashes would be along the lines of NEGATIVE THETA etc), but something to do directly or indirectly with disk access (which is why I mentioned antivirus software earlier). I'm a volunteer and my views are my own. News and Announcements and FAQ |
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
One more thing: This is the model running on your machine: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15998767 This is the same model running on someone else's machine. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15816061 Note that they both crashed at the same point. It might simply be coincidence (the risky 50% point), or perhaps the model itself was doomed to die then anyway. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
When I run several models on one machine, I stagger them a bit by suspending some for different intervals. Hopefully, this means that they're after the same resources at different times. It certainly means that they're at the 25% points at different times. But with 16 at once, maybe the best advice is: Good luck. |
Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,610,380 RAC: 3,377 |
MartinNZ wrote: Mike's starting points were already in place, except for excluding c:/temp/ or similar from the AV. Not sure that is a good directory to exempt AV scanning, but if needs must. How do we know for certain this is where BOINC builds the zipped files? I notice a file in c:/windows/temp/ called DMI3FCD.tmp of 0KB and timed at 17:37 (NZ time), which was the time task 15998767 crash. Could be it?There's no C:\Temp folder on my computer, and C:\Windows\Temp is only accessible with administrative permissions, which Boinc doesn't have on my computer. So Boinc doesn't use these folders. Climateprediction isn't a real time program, so a slow drive cannot be the cause of a crashing task. My advice is not to run any other projects together with Climateprediction on the same computer simultaneously, to keep it from being interrupted periodically. And exclude the Boinc data folder from scanning for viruses, as mentioned by others. Starting from Boinc version 6, viruses in this folder cannot access programs or non-Boinc data on your computer. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Phew, thanks for all the helpful feedback. Les mentioned When I run several models on one machine, I stagger them a bit by suspending some for different intervals. I wonder if this is a key issue? Early on when the program was released to get more tasks, 4 arrived at once - big job when it comes to reporting. But then surely this happens with other PCs? Things I've done.
|
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
... currently set at 8 tasks. When Mike first mentioned 16 threads, I took that as meaning CPU threads as hyperthreading is on. ... Yes, I was looking at the 'processor' count on your computer page (= actually the number of CPU threads). 8 models is easier on the machine than 16 :-) I actually run 6 models on 4 cores / 8 threads, any more than that and a) it makes my machine struggle, and b) throughput did not increase anyway. The best individual processing speed comes from running one model per core. Let us know if you have any failures after the above changes. (Fingers crossed...) I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Oh dear... And it seemed to be going so well. Another crash 16002819 but this time not me? (INVALID THETA DETECTED - error type added in edit) I take it from Les's post 45480 in another thread that this is a model error? Nothing abnormal happening at the time. If this is indeed a model error, do you reckon it's OK to get some more tasks? Things seem to have settled down in the last few days, with quite a few models getting past 25/50/75% points. |
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Yes, the INVALID THETA errors are different. They can be caused by either the model's initial parameters resulting in an implausible climate, or they can be due to floating point errors creeping in. They will also be at the 25%/50%/75% boundaries because that is when the model validation takes place. You've only had the one of these, and your machine has passed long stability checks, so in your case I think the model itself is to blame. If you were getting lots of INVALID THETAs, while other people running the same models were not, then there would be a cause for concern, but that isn't the case. Even if you saw a number of THETAs turning up, they may be related to a particular batch of models, so one of the things to look at would be if they were generated at similar times or different times. I would suggest ramping up the number of models & seeing what happens. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Final(?) mods underway.
2. CPU Water Cooler - Corsair H100i. CPU temps are reaching 73C (Max for CPU is 85C) and we are only in very early spring. Standard air cooling will struggle in summer and if more than 8 tasks run. The inclusion of these while rare in desktops is not abnormal in workstations as all the major suppliers (e.g. HP, Dell) offer water cooling as a standard option.
|
![]() Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
Hi Martin, I don't think you said what brand and model of power supply is in the machine? I found that the power supply makes a difference. Also, sizing. The rule of thumb is to aim for full-on processing, including graphics card, to be no more than 2/3 of the rated power of the supply. (And no less than 1/2, for optimum efficiency.) I'd estimate a 550W-650W class supply for your machine. FYI on my Sandy Bridge Core i7, i7z (a Linux CPU reporting tool) reports temperatures of 83 - 85 degrees Celsius with 8 models running, and the machine's been stable for the last couple of years (... touch wood). (I do need to vacuum out the CPU heatsink fins six-monthly.) Ivy Bridge CPUs may be more touchy, of course. If you still have problems even after your UPS and water(!) cooling, the last resort (before a different motherboard) is to underclock a few percent and see if that helps. I feel for you. This must be frustrating. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Hi Greg, I gave up being frustrated years ago - the hair is grey enough as it is. 99.9% of power supplies are oversized, including mine. Have a Corsair HX650 (supposedly a badged Seasonic G-650, 650W, 80 Plus Gold.) Assuming my metered wattages are correct, they are in my post 46956 below. When running with 8 tasks I pull around 155W. Peak efficiency is at 50%, but according to the Plus 80 test result, the efficiency for 115VAC is 88.6% @ 20% load, 91% @ 50% and 88.9% @ 100%. From memory HardwareSecrets had similar numbers in a test report. So I could have got a smaller power supply, but that does not allow for peak loads. Couldn't find a load calculator that included the Xeon E5-2670, but the Thermaltake one came the closest for the majority of my components and it calculated 537W for the Powersupply. These estimates are always over, but what else can you do. As for temperatures, I'm erring on the side of extreme caution. Temperatures always do my head in as they are never straightforward and I'm no expert. Intel give a Tcase 85C max for my processor. However NONE of the monitoring software reports this correctly for my motherboard. What they show as Tcase/CPU temp stays constant when core temps have increased 40C. The other key and related temp is Tjunction max (Core temp), but Intel do not give this that I can find. Core Temp (Windows) reports this as 102C, which seems about correct from what I've read. The CPU will throttle about 5-10C before this, and the recommendations I've read say stay around 20C below TjMax for stability and long life. I can easily see my Tjunction getting to 85-90C in the middle of summer, so I decided to get in early and add more cooling. I want this rig to last 4-5 years. My old i7 ran in the low 60sC, but I had a massive air cooler on it. Your temp seems high, but if it's running OK, fine I suppose. I also clean mine out every 6-9 months, but take into the garage and use a compressor - with due care of course, and no fan spinning. Vacuum cleaners do run a static risk. |
![]() Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
the recommendations I've read say stay around 20C below TjMax for stability and long life. This fits with what I read when I was building audio amps. Audio amps we like to keep running for decades. But what's long life for a computer? Hmmm... maybe I should get a better cooler, too. (Audio is all different now--that was class B bipolar transistors; now hi-fi amps are mostly class D, and the things barely get warm at all, while providing much better sound. And with high-frequency switching power supplies, like those in computers, they barely weigh anything either. Hurray for modern power MOSFETs.) Your power supply sounds good, so that's blown that theory. Your earlier post sounded like you live a bit out of town, so the voltage at your house may vary quite a bit. The UPS should help considerably with that. |
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I do have a UPS also - an APC SmartUPS-2200 which can run my PC for 20-30 minutes. Second-hand, and very cheap from ebay because it was so heavy (had to collect it). But if you get an ebay one you will need to replace the batteries. At the time I was getting 10 powercuts / month. I'm not currently running it because the power supply has been much improved and I no longer get powercuts. I'm a volunteer and my views are my own. News and Announcements and FAQ |
©2025 cpdn.org