Message boards : Number crunching : Rig Behaving Badly...
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Aug 04 Posts: 9 Credit: 524,774 RAC: 0 |
Greetings All: In the hope of furthering CPDN (and of course to boost my credit) I decided to Build a new Dual Xeon (Nocona Based) System. But for some reason she continues to Freeze Consistently when I attempt to leave her to crunch. I am hoping for some assistance in furthering my troubleshooting of this problem in the hope of getting her up and running as obediently as her cohorts do. 1st off, this is the system configuration. 2 2.8 GHZ 800 FSB Nocona XEON processors with 1024kb cache each. Iwill DH800 Dual Xeon MB 1 36 GB WDC Raptor SATA Drive, 2 x 512 corsair PC3200 DDR Ram modules Geforce 6800 with 128 ram (61.77 driver) Win XP Pro, clean install, nothing running in tandem with Boinc. Now, here is the situation thus far regarding this machine. I have attempted to leave her run while she had SP1 with boinc 4.09 and only CPDN running 4 concurrent models on her, within 2 hrs she totally froze (hard reset required) I then dug up old boinc 4.05, same thing... Upgraded to SP2, same thing, went back to 4.09 same thing. At that point I decided to see if it was the machine causing the error, so i let her run without anything demanding cpu cycles, she ran fine for 2 days... so i decided to run boinc again, same crash... the computer runs nice and cool (50C for each CPU and 63 C for the Geforce6800... so its not an overheat issue. but I am at a total loss as to why she is misbehaving so... any suggestions would be greatly appreciated... - update 10/12/04 @23.00 EST I have just installed boinc 4.12 and have downloaded new models, I HOPE that this will solve my problem, will update again in the AM... |
Send message Joined: 5 Aug 04 Posts: 39 Credit: 14,887 RAC: 0 |
You can't test the stability by letting it run iddle - nothing will happen then unless your system is REALLY screwed up. It sounds like the memory or something can't keep up when being used 110% percent - have you tried running memtest86 to see if there is a problem? You can burn it to a CD and boot from that... If it freezes during memtest and doesn't print any errors then something is not right. Is the system overclocked? What frequency rate and settings do you run the RAM at? |
Send message Joined: 5 Aug 04 Posts: 426 Credit: 2,426,069 RAC: 0 |
There is a known bug with CCv4.09 and lower that will cause the system to hang if there is not enough work for all processors. I don't know if it has been fixed in CCv4.12 or not. Could this be what you are seeing? Or have you always had the same 4 workunits? <br>John Keck -- BOINCing since 2002/12/08 -- <a href="http://www.boinc.dk/index.php?page=user_statistics&project=cpdn&userid=191"><img border="0" height="80" src="http://191.cpdn.sig.boinc.dk?188"></a> |
Send message Joined: 7 Aug 04 Posts: 57 Credit: 4,168 RAC: 0 |
But the tests doesn't realy help. I have done it and all worked fine.Nevertheless CPDN crashedon my overclocked machine. So I had to go back to standart. <a href="http://setiweb.ssl.berkeley.edu/team_display.php?teamid=30336"><img src="http://adastrawithseti.de/pic/logo.jpg"></a> <br>Greetings from Germnay</br> <br>Basti</br> |
Send message Joined: 27 Aug 04 Posts: 9 Credit: 524,774 RAC: 0 |
4.12 boinc did not solve the issue, after its download and installation CPDN decided to download 4 new models and begin again. two of which suffered computational errors (visual fortran) and subsequently terminated. Later that hour the machine froze yet again. -Janus: both of the cpu's have locked FSB multipliers so I could not O/C them unless i wanted to start capping pins on the CPU (Not going to happen) The memory is also stock rating, I havent even bothered trying to set it lower than its 3-3-3-8 timings. When I get the chance I will attempt to find a copy of memtest and play with that. in case anyone cared/cares, this is the link to the host in question: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=56399 More Suggestions would be appreciated... <a href="http://www.boinc.dk/index.php?page=user_statistics&project=cpdn&userid=1940"><img border="0" height="80" src="http://1940.cpdn.sig.boinc.dk?188"></a> |
Send message Joined: 7 Aug 04 Posts: 187 Credit: 44,163 RAC: 0 |
> When I get the chance I will attempt to find a copy of > memtest and play with that. http://memtest86.com/ <a href="http://www.boinc.dk/index.php?page=user_statistics&project=cpdn&userid=355"><img border="0" height="80" src="http://355.cpdn.sig.boinc.dk?188"></a> |
Send message Joined: 27 Aug 04 Posts: 9 Credit: 524,774 RAC: 0 |
Thank you for the URL Heffed, It is currently running test 5, will report on its findings later in the evening. BTW, It has been determined that this is not a boinc related issue, the computer also froze with SETI Driver in a similar time frame, thus strongly pointing to a faulty hardware component. chief suspect would be the memory dimms, perhaps one of the CPU's if nothing can be found wrong with the dimms... Thank you all once again. <a href="http://www.boinc.dk/index.php?page=user_statistics&project=cpdn&userid=1940"><img border="0" height="80" src="http://1940.cpdn.sig.boinc.dk?188"></a> |
Send message Joined: 5 Aug 04 Posts: 39 Credit: 14,887 RAC: 0 |
> But the tests doesn't realy help. I have done it and all worked > fine.Nevertheless CPDN crashedon my overclocked machine. So I had to go back > to standart. Oh, the test works - they just don't test the stability of your computer as you would expect, they test the stability of the RAM. For instance it doesn't load the CPU much to move around blocks in RAM... To check if it is a CPU issue another test is used. Something that does a lot of calculations but doesn't address memory or harddrive at all. The idea is to test everything in isolation and see when the error occours. About FSB multp lock - you could try changing the FSB speed then. That would under/overclock the thing. As a matter of fact, my system works very unstable at the standard speed I got it with from the start, it has to run a little overclocked to run perfectly (weird, I know...), I have matched it down to a problem with sync to the HDD controller when running in certain speeds. |
Send message Joined: 27 Aug 04 Posts: 9 Credit: 524,774 RAC: 0 |
Troubleshooting for this rig has now concluded. Through www.7byte.com 's Hot CPU Tester, It has been determined that the CPU in Socket 0 is inept when it comes to deriving Pi. Should the individual who sold me the pair of CPU's be unable to furnish me with a replacement the Pi-Fumbling CPU may turn into a rear view mirror ornament ;) <a href="http://www.boinc.dk/index.php?page=user_statistics&project=cpdn&userid=1940"><img border="0" height="80" src="http://1940.cpdn.sig.boinc.dk?188"></a> |
Send message Joined: 5 Aug 04 Posts: 39 Credit: 14,887 RAC: 0 |
> Through www.7byte.com 's Hot CPU Tester, It has been determined that the CPU > in Socket 0 is inept when it comes to deriving Pi. Great you found the error =) Please first switch the CPUs to see if it is the motherboard socket that has an error - it has happened, although very seldomly... |
Send message Joined: 27 Aug 04 Posts: 9 Credit: 524,774 RAC: 0 |
It seems today is my lucky day, I have contacted my source and he indeed has another identical CPU which should be arriving within a couple business days. That should remedy the difficulty the rig has been having. I will post again once I have tested the rig with the replacement chip, if all goes according to plan this will be the end of the stability issues and allow this rig to join the ranks of trickling machines :) <a href="http://www.boinc.dk/index.php?page=user_statistics&project=cpdn&userid=1940"><img border="0" height="80" src="http://1940.cpdn.sig.boinc.dk?188"></a> |
Send message Joined: 4 Sep 04 Posts: 61 Credit: 80,585 RAC: 0 |
By now, I'll assume that the CPU was, in fact, the issue. Nice work. This is an excellent example of intelligent troubleshooting. |
©2024 cpdn.org