climateprediction.net (CPDN) home page
Thread 'Rig Behaving Badly...'

Thread 'Rig Behaving Badly...'

Message boards : Number crunching : Rig Behaving Badly...
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user1940

Send message
Joined: 27 Aug 04
Posts: 9
Credit: 524,774
RAC: 0
Message 5288 - Posted: 13 Oct 2004, 2:17:25 UTC
Last modified: 13 Oct 2004, 2:35:33 UTC

Greetings All:

In the hope of furthering CPDN (and of course to boost my credit) I decided to Build a new Dual Xeon (Nocona Based) System. But for some reason she continues to Freeze Consistently when I attempt to leave her to crunch. I am hoping for some assistance in furthering my troubleshooting of this problem in the hope of getting her up and running as obediently as her cohorts do.

1st off, this is the system configuration.

2 2.8 GHZ 800 FSB Nocona XEON processors with 1024kb cache each.

Iwill DH800 Dual Xeon MB

1 36 GB WDC Raptor SATA Drive,

2 x 512 corsair PC3200 DDR Ram modules

Geforce 6800 with 128 ram (61.77 driver)

Win XP Pro, clean install, nothing running in tandem with Boinc.

Now, here is the situation thus far regarding this machine. I have attempted to leave her run while she had SP1 with boinc 4.09 and only CPDN running 4 concurrent models on her, within 2 hrs she totally froze (hard reset required)

I then dug up old boinc 4.05, same thing...

Upgraded to SP2, same thing, went back to 4.09 same thing.

At that point I decided to see if it was the machine causing the error, so i let her run without anything demanding cpu cycles, she ran fine for 2 days... so i decided to run boinc again, same crash...

the computer runs nice and cool (50C for each CPU and 63 C for the Geforce6800... so its not an overheat issue. but I am at a total loss as to why she is misbehaving so... any suggestions would be greatly appreciated...


- update 10/12/04 @23.00 EST

I have just installed boinc 4.12 and have downloaded new models, I HOPE that this will solve my problem, will update again in the AM...
ID: 5288 · Report as offensive     Reply Quote
old_user73

Send message
Joined: 5 Aug 04
Posts: 39
Credit: 14,887
RAC: 0
Message 5297 - Posted: 13 Oct 2004, 6:46:33 UTC - in response to Message 5288.  

You can't test the stability by letting it run iddle - nothing will happen then unless your system is REALLY screwed up.

It sounds like the memory or something can't keep up when being used 110% percent - have you tried running memtest86 to see if there is a problem? You can burn it to a CD and boot from that...
If it freezes during memtest and doesn't print any errors then something is not right.

Is the system overclocked? What frequency rate and settings do you run the RAM at?
ID: 5297 · Report as offensive     Reply Quote
ProfileKeck_Komputers
Avatar

Send message
Joined: 5 Aug 04
Posts: 426
Credit: 2,426,069
RAC: 0
Message 5298 - Posted: 13 Oct 2004, 7:06:22 UTC

There is a known bug with CCv4.09 and lower that will cause the system to hang if there is not enough work for all processors. I don't know if it has been fixed in CCv4.12 or not. Could this be what you are seeing? Or have you always had the same 4 workunits?
<br>John Keck -- BOINCing since 2002/12/08 -- <a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=cpdn&amp;userid=191"><img border="0" height="80" src="http://191.cpdn.sig.boinc.dk?188"></a>
ID: 5298 · Report as offensive     Reply Quote
Profileold_user392
Avatar

Send message
Joined: 7 Aug 04
Posts: 57
Credit: 4,168
RAC: 0
Message 5300 - Posted: 13 Oct 2004, 7:41:37 UTC

But the tests doesn't realy help. I have done it and all worked fine.Nevertheless CPDN crashedon my overclocked machine. So I had to go back to standart.


<a href="http://setiweb.ssl.berkeley.edu/team_display.php?teamid=30336"><img src="http://adastrawithseti.de/pic/logo.jpg"></a>
<br>Greetings from Germnay</br>
<br>Basti</br>
ID: 5300 · Report as offensive     Reply Quote
old_user1940

Send message
Joined: 27 Aug 04
Posts: 9
Credit: 524,774
RAC: 0
Message 5314 - Posted: 13 Oct 2004, 14:09:06 UTC - in response to Message 5300.  

4.12 boinc did not solve the issue, after its download and installation CPDN decided to download 4 new models and begin again. two of which suffered computational errors (visual fortran) and subsequently terminated. Later that hour the machine froze yet again.

-Janus:

both of the cpu's have locked FSB multipliers so I could not O/C them unless i wanted to start capping pins on the CPU (Not going to happen) The memory is also stock rating, I havent even bothered trying to set it lower than its 3-3-3-8 timings. When I get the chance I will attempt to find a copy of memtest and play with that.

in case anyone cared/cares, this is the link to the host in question:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=56399

More Suggestions would be appreciated...
<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=cpdn&amp;userid=1940"><img border="0" height="80" src="http://1940.cpdn.sig.boinc.dk?188"></a>
ID: 5314 · Report as offensive     Reply Quote
old_user355

Send message
Joined: 7 Aug 04
Posts: 187
Credit: 44,163
RAC: 0
Message 5318 - Posted: 13 Oct 2004, 16:18:09 UTC - in response to Message 5314.  

&gt; When I get the chance I will attempt to find a copy of
&gt; memtest and play with that.

http://memtest86.com/

<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=cpdn&amp;userid=355"><img border="0" height="80" src="http://355.cpdn.sig.boinc.dk?188"></a>
ID: 5318 · Report as offensive     Reply Quote
old_user1940

Send message
Joined: 27 Aug 04
Posts: 9
Credit: 524,774
RAC: 0
Message 5324 - Posted: 13 Oct 2004, 20:16:29 UTC - in response to Message 5318.  

Thank you for the URL Heffed, It is currently running test 5, will report on its findings later in the evening.

BTW, It has been determined that this is not a boinc related issue, the computer also froze with SETI Driver in a similar time frame, thus strongly pointing to a faulty hardware component. chief suspect would be the memory dimms, perhaps one of the CPU's if nothing can be found wrong with the dimms...

Thank you all once again.
<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=cpdn&amp;userid=1940"><img border="0" height="80" src="http://1940.cpdn.sig.boinc.dk?188"></a>
ID: 5324 · Report as offensive     Reply Quote
old_user73

Send message
Joined: 5 Aug 04
Posts: 39
Credit: 14,887
RAC: 0
Message 5338 - Posted: 14 Oct 2004, 7:13:27 UTC - in response to Message 5300.  
Last modified: 14 Oct 2004, 7:20:04 UTC

&gt; But the tests doesn't realy help. I have done it and all worked
&gt; fine.Nevertheless CPDN crashedon my overclocked machine. So I had to go back
&gt; to standart.

Oh, the test works - they just don't test the stability of your computer as you would expect, they test the stability of the RAM. For instance it doesn't load the CPU much to move around blocks in RAM...

To check if it is a CPU issue another test is used. Something that does a lot of calculations but doesn't address memory or harddrive at all.

The idea is to test everything in isolation and see when the error occours.

About FSB multp lock - you could try changing the FSB speed then. That would under/overclock the thing. As a matter of fact, my system works very unstable at the standard speed I got it with from the start, it has to run a little overclocked to run perfectly (weird, I know...), I have matched it down to a problem with sync to the HDD controller when running in certain speeds.
ID: 5338 · Report as offensive     Reply Quote
old_user1940

Send message
Joined: 27 Aug 04
Posts: 9
Credit: 524,774
RAC: 0
Message 5379 - Posted: 15 Oct 2004, 2:12:54 UTC - in response to Message 5338.  
Last modified: 15 Oct 2004, 6:01:18 UTC

Troubleshooting for this rig has now concluded.

Through www.7byte.com 's Hot CPU Tester, It has been determined that the CPU in Socket 0 is inept when it comes to deriving Pi.

Should the individual who sold me the pair of CPU's be unable to furnish me with a replacement the Pi-Fumbling CPU may turn into a rear view mirror ornament ;)

<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=cpdn&amp;userid=1940"><img border="0" height="80" src="http://1940.cpdn.sig.boinc.dk?188"></a>
ID: 5379 · Report as offensive     Reply Quote
old_user73

Send message
Joined: 5 Aug 04
Posts: 39
Credit: 14,887
RAC: 0
Message 5385 - Posted: 15 Oct 2004, 7:00:54 UTC - in response to Message 5379.  
Last modified: 15 Oct 2004, 7:01:20 UTC

&gt; Through www.7byte.com 's Hot CPU Tester, It has been determined that the CPU
&gt; in Socket 0 is inept when it comes to deriving Pi.

Great you found the error =)
Please first switch the CPUs to see if it is the motherboard socket that has an error - it has happened, although very seldomly...
ID: 5385 · Report as offensive     Reply Quote
old_user1940

Send message
Joined: 27 Aug 04
Posts: 9
Credit: 524,774
RAC: 0
Message 5410 - Posted: 15 Oct 2004, 19:23:11 UTC

It seems today is my lucky day, I have contacted my source and he indeed has another identical CPU which should be arriving within a couple business days. That should remedy the difficulty the rig has been having.

I will post again once I have tested the rig with the replacement chip, if all goes according to plan this will be the end of the stability issues and allow this rig to join the ranks of trickling machines :)
<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=cpdn&amp;userid=1940"><img border="0" height="80" src="http://1940.cpdn.sig.boinc.dk?188"></a>
ID: 5410 · Report as offensive     Reply Quote
Profileold_user11965

Send message
Joined: 4 Sep 04
Posts: 61
Credit: 80,585
RAC: 0
Message 5665 - Posted: 26 Oct 2004, 8:14:06 UTC

By now, I'll assume that the CPU was, in fact, the issue. Nice work. This is an excellent example of intelligent troubleshooting.

ID: 5665 · Report as offensive     Reply Quote

Message boards : Number crunching : Rig Behaving Badly...

©2024 cpdn.org