climateprediction.net (CPDN) home page
Posts by Ant B

Posts by Ant B

InfoMessage
1) Message boards : Number crunching : Optimised I/O 5.44 problem?
Message 31079
Posted 23 Oct 2007 by Ant B
Success, it seems!

Much rejoicing here since uninstalling the controller drivers and restoring all the BIOS settings to where they were previously. Immediately better. There was one further BSoD the next day, but the reboot was uneventful, and only one dump (IRQL_NOT_MORE_OR_LESS_EQUAL) pointing to IRQ14 again. So I backdated the IDE controller driver (previously I had done only the SATA driver) and - hey presto. All stable for a week now, and happily crunching. There doesn\'t appear to be any noticeable decrement in performance either.

The lesson is that the most up to date drivers sometimes aren\'t the best.

So apart from restoring faith in CPDN and my hardware, it saved me having to learn how to use debugging tools - that would have wasted a few evenings....

Thanks everyone.

Anthony
2) Message boards : Number crunching : Optimised I/O 5.44 problem?
Message 30980
Posted 16 Oct 2007 by Ant B
A quick update on progress so far.

Firstly - BOINC and the model are out of the frame and it does look like hardware. Memory checks still run fine, swapping out the chips doesn\'t help - they fail individually and in combination. In fact things seem to get worse and I have begun getting BSoD in operation now. Memory dumps are varied (the problem is you get the last one, never the first) but I think do point to the disk controller. Fiddling with BIOS and turning on memory execute protection, memory compatibility, turning off speedstep and various other bits just make it worse. I think one model crashed as a result the other day.

So I am uninstalling the new updated drivers for I/O controllers and hard drive controllers and installing two versions back - we will see what happens.

Wish me luck

Anthony
3) Message boards : Number crunching : Optimised I/O 5.44 problem?
Message 30825
Posted 5 Oct 2007 by Ant B
Ant B:

A few important things that everyone has seemed to miss so far:
1) CPU temperatures - should be below 60 C for stability;
2) HD temperatures - high temps can cause corruption; I\'ve experienced it!
3) Pagefile usage - high pagefile usage cause excessive disk access during shutdown; do you have other apps running? Did you check for memory leaks (mem usage growing all the time)?
4) Any other changes in the system? Windows Update, etc.? Are you running service pack 4?
5) As something to try last, disable write-behind caching on your HD. See your Disk properties under Device Manager.

If this is too technical, let me know and I\'ll try to simplify/explain better.


Thanks everyone for the helpful comments
As for the hardware issues:

CPU temperatures - Does anyone know a utility that works with my ASRock Conroe 1333 DVI board? I had a utility on my old machine that ran smoothly as a plug-in through mmc. I am not sure this one is running hot, but it\'s stable for long periods at full load, and has been since I put it together in June.

HD temperatures - same problem with no hardware monitor I know of. This hard drive is a year or two old and, until a week after the new model, ran flawlessly. I haven\'t pursued this line of enquiry because there is so little disc activity related to the models compared with other times / apps, and these cause no symptoms.

Pagefile use I know nothing about - where can I check this? I do know that memory use is stable. No leaks according to task manager. The models use only 100MB each, and I have nearly 1.5GB unsued physical memory even with two models running. Swapfile size is left on automatic. Other apps are running, but just domestic stuff which I have run for ages. IE6, Office, etc. Firewall and antispyware scans - but no antivirus.

System is up to date and not recently changed, apart from another 1GB memory chip after the first crash (which didn\'t help). SP4 and all the windows updates installed automatically. Drivers are up to date according to driveragent.

I thought of hard drive caching, and found myself frustrated. The option is not visible on the hard drive properties. I thought I was going nuts, but it is present on the spare drive I plugged in to recover from crashes. Is this the drive or perhaps because I am running it on a SATA port through an adapter plug? I\'d love an answer to this one.

I\'ll keep plugging away at it - though if I shut down BOINC before shutting windows I have a very strong hunch that I will never see the error again.

Keep the suggestions coming though.
4) Message boards : Number crunching : Optimised I/O 5.44 problem?
Message 30807
Posted 4 Oct 2007 by Ant B
Thanks for the comments. I am still slowly working things out. It definitely remains a problem of disk corruption at shutdown. I do not doubt that memory faults can cause this. However, it runs for days faultlessly if I don\'t shut it down. I do not know enough to know whether this is meaningful or not.
So far: Memtest86 has run overnight, 7 times through the cycles, with no indication of fault. Maybe I should do it for longer.
Another mem stress tool within windows has run for a day or so - while I let one model run - using up all the spare memory. All OK, no memory faults reported.
I haven\'t changed the physical configuration of memory. There was only one chip on board when it first happened - which I used as an excuse to get another. Evidently this did not fix the error, and I have left well alone.
Before I posted earlier, I did try putting the hard drive on a different controller while it was doing its intermittent crashing, because I suspected it may be something odd like a loose plug - it\'s an IDE drive on an adapter to a SATA port in IDE emulation mode (according to the BIOS). I put it staight on an IDE cable - no better. Changing the controller driver just seemed to make things worse.
So now everything is back to where it was and I am just exiting BOINC before shutdown every time. So far so good. I am wondering about changing the frequency of writing to the hard drive in my preferences settings - any ideas on whether this may make a difference?
5) Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion
Message 30806
Posted 4 Oct 2007 by Ant B
I had one of these I think. I hope this link works
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6732317
At that time I was running two slab models, and this one just stopped. Not quite stopped, but it went from two trickles a day to not managing to get out of June 2075 or wherever it was. I was frustrated because it was at about 95% complete. Killing it and restoring from a backup made no difference - it got to the same place and just \'stopped\' again. Still was using all the CPU, and did move between timesteps, but so slowly it never got to the next checkpoint even. It wasn\'t looping as far as I could see. PC is standard issue, Intel Core 2 6320, not overclocked, running on 1GB RAM at that stage. It\'s the first one I have aborted :(
6) Message boards : Number crunching : Optimised I/O 5.44 problem?
Message 30777
Posted 1 Oct 2007 by Ant B
So far my crunching has been hassle-free - BBC coupled model and a few slab models (except for one that froze at 95% odd). Now, since downloading two 5.44 models I have been seeing the \'blue screen of death\' on my W2K machine. First time I have seen that in over 4 years. This machine has been running steadily since July 07. Messages (on screen, in memory dump file and event viewer) mentioned boot sector corruption, ntoskrnl corruption, hard drive failure, i/o conflicts and the like. At one point it even said the bios was corrupt. This only after shutdown - the machine runs without a hitch otherwise. Am I right in suspecting the new model perhaps? I have suspended the models and if nothing untoward happens in the next week I will post an update, but would be glad to hear comments or whether anyone else has had similar experiences.
7) Message boards : Number crunching : Where is my bottleneck?
Message 29754
Posted 28 Jul 2007 by Ant B
I think that there is something mucking up the efficient running of my model on my machine.

To start with, my trickle results show a consistent \"Avg sec/TS\" of 4.6. However, I see that \"2.5 Seconds/Timestep computational average\" is shown as a benchmark in the \"Application Preferences\" secion of the cpdn preferences.

So, if the \"2.5 CPU seconds per timestamp\" benchmark is reasonable, can anyone see where I may be losing efficiency to the tune of nearly 50%?

Thanks,
Ed


Ed - Your PC seems just fine. Good specs, benchmarks are all OK etc. I think it\'s simply that you are only allowing CPDN 50% of the CPU respources, so your T/S is about double what it would be otherwise. If the other 50% is doing nothing, you may as well let CPDN use it. The programme is so efficient at running on low priority you won\'t notice it at all in the background.

Cheers

Anthony
8) Message boards : Cafe CPDN : can a task really take this long?
Message 27350
Posted 15 Mar 2007 by Ant B
I have just started on CPDN and as far as I can see there are over 7000 hours to complete. As the experiment appears to have a report deadline next november surely my experiment is doomed not to finish?

Bit depressing that !

Hope I have misunderstood the information.



James,

There certainly seems to be something odd about your PC speed - it should be going a lot quicker. I see a few things which are of concern:
Your processor speed benchmarks (look on your computer page) are very slow and suggest your machine is running a quarter of its proper speed. These should show numbers which are equivalent to your processor speed, so 3200 integer speed (or maybe 1600x2, since you have a dual core processor). But definitely not 600. This may mean someone has set up your computer wrong in the first place (an upgrade perhaps?), or there was a heavy application running when the benchmark was run. The suggestion about overheating is valid, but I think unlikely. You may want to suspend the project and run the PC benchmark tests manually (advanced options - run CPU benchmarks) to see whether this is just anomalous reporting, or check your BIOS settings on the PC.
Secondly, I see you are not running the latest version of the software. The latest version is 5.8.11. I think I saw in one of your error messages that you are running 5.4.1. Perhaps an undate may help.
A couple of the error messages from crashed models seem to point to files getting corrupted. I have no clue what this means, except to wonder whether the shutdown sequence is not happening properly. Perhaps the update will help.

Look at your hardware settings to start with, see whether you can get the CPU benchmarks up to the proper amount, and see what happens.

Good luck.

Anthony


©2025 cpdn.org