climateprediction.net home page
BOINC 4.19 + CPDN 4.04 PDB files?

BOINC 4.19 + CPDN 4.04 PDB files?

Questions and Answers : Windows : BOINC 4.19 + CPDN 4.04 PDB files?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 9487 - Posted: 17 Feb 2005, 21:46:37 UTC
Last modified: 17 Feb 2005, 21:55:20 UTC

Is there such a file somewhere?

I expect a crash in about 6 hours on trickle 24 - it probably will freeze the computer but maybe it gets a chance to write some debug info before it does that.

Sorry, I should have asked earlier but I got the idea just now.

I'm on DSL and don't mind the big download


edit : no hurry anymore, I did a backup so if the crash comes, I can let it crash multiple times.
ID: 9487 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 9496 - Posted: 18 Feb 2005, 5:59:28 UTC
Last modified: 18 Feb 2005, 6:52:49 UTC

The freeze came as expected, I will try to make it survive trickle 24 with the CLI now with nothing heavy running on the second CPU. I made another backup on 32% so "something" should happen quite soon.


Update : it did happen very soon, this time I watched :

It does that cursor flashing stuff and seems to write to the HD, then it freezes the whole system without any output to the console screen where CLI is running. Reset required, no keyboard input accepted anymore, mouse frozen too.


Any idea how I can track down that bug? Some flag that makes the CPDN client be a bit more talkative?

Same as usual : 4128_000209885.xml has 13322 null bytes in it, 4128aa.pa.gmts.x1.nc and 4128aa.pa.rmts.x1.nc are NULLed too


The previous crash of the same type is here:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=1845


I will do a windows update now and - as it is not my primary computer - I will take everything Microsoft wants me to take. Then I will give this nasty trickle 24 another chance.
ID: 9496 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 9497 - Posted: 18 Feb 2005, 7:21:08 UTC
Last modified: 18 Feb 2005, 7:35:16 UTC

OK, that didn't help, it froze again - but another hint that might help locate the problem:

While the cursor was flashing and it was writing to HD, the CPU load was between 18% - 22%.

Just 2 seconds before the freeze occured, the cursor didn't flash anymore and was still moveable. The CPU load went back to 50% (i.e. one CPU at 100%, the other one doing nothing) just as if it would be crunching again. No major changes in memory usage.


And another hint : the crash seems to happen always when writing 4128aa.pa.rmts.x1.nc, just the number of NULLs it puts into that file differs from crash to crash.
ID: 9497 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 9500 - Posted: 18 Feb 2005, 8:32:56 UTC
Last modified: 18 Feb 2005, 8:34:12 UTC

Hi Ananas
I don't know what PDB is, (and Google didn't help). However: when the cursor stops flashing, cp spends about a minute
doing some 'housekeeping', not sure what. If you use the gui, the Work tab shows the same message in Status
as during the disk writes during this period.

Has your hd been defragged recently?
Have you tried running Prime95 to check for computer stability?
There are 2 other programs that can be run, one to test memory. Information about this is on the board
which is down, and I can't remember them.

Backing up was a good move: some people have done this, and restarted a lot of times to nurse it through.
Not to be done with other dc projects, apparently, but that's another story.

The bit about a certain file may be a clue that someone more expert could use to pinpoint the problem.
Not sure what else to suggest, except keep at it.

Les
ID: 9500 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 9502 - Posted: 18 Feb 2005, 9:14:44 UTC - in response to Message 9500.  

Hi,

PDB means Project DataBase I think, VC++ creates those to help Dr. Watson figure out the crash location. They seem to be Microsoft's replacement for the memory maps (.MAP) and symbol table (.SYM) files that normal linkers produce.

=> http://climateapps2.oucs.ox.ac.uk/cpdnboinc/debug.php (but that's outdated)

Defrag isn't necessary, it is NTFS and the 8GB partition currently has only BOINC on it - which I just copied back for the second crash test. So fragmentation should be about 0%. I scanned the HD for errors after every crash to make sure there isn't anything inconsistent.

I have used the computer in question for several DC things for about 2 years, not doing anything else with it. CPDN (BOINC) runs stable too, except for trickle 24. This error, exactly the same one, happened a few times already with different models.

As it is a dual Tualatin server board, there's no risk of overheating. The coolers even have fans, which would not be required even under high load.

As I knew that CPDN would crash, I detached the other BOINC projects so I can test it better.

Those NULLed files - especially the XML files - look very much like a pointer or array initialisation problem. Or it is some ressource shortage that has not been caught, like too many memory handles used or too many open files.


If I cannot deliver this trickle 24 I have to give up running CPDN on this computer, it just makes no sense to calculate 23.99 trickles on a bunch of model and I never got further than that. On my other dual CPU machine (Athlon), CPDN doesn't run very smooth either whereas I could finish a model easily on a single CPU P4/2600.

That reminds me to backup the model on the AthlonMP - it's at trickle 47 now *sigh*

Volker
ID: 9502 · Report as offensive     Reply Quote
Gareth Lock

Send message
Joined: 2 Sep 04
Posts: 51
Credit: 451,236
RAC: 0
Message 9513 - Posted: 18 Feb 2005, 15:43:59 UTC

Sounds to me like it could be an underlying hardware fault. Have you tried doing a memory test on the machine in question? If you are using a NT based machine (NT/2K/XP) see if you can get it to do a minidump. If the crash address is the same each time then try swapping out the RAM and see if this cures the problem.

I belong to a club that reconditions old machines and I had wiped the drive of a machine, installed Win98 and the drivers successfully, when I tried to install one of the standard packages that we ship with our machines, the system got half way through the installation and froze. I did the usual stuff... Cleaned the install CD, switched out the CD-ROM etc, with no luck. I was just about to give up on the thing as a bad lot when, out of curiosity I used our diagnostic floppy to do a complete memory scan and found a number of stuck bits on one of the RAM DIMMS. I wasn't paying much attention to the crash messages from the installer at the time, but thinking about it, the address and value I was getting back as an error from the installer was identical to the report I recieved from the memory checker.

If the RAM passes, then all I can think of is a heat problem. Maybe with the RAM rather than the CPU.

Hope all this helps.

<img src="http://boinc.mundayweb.com/one/stats.php?userID=444&amp;trans=off">
ID: 9513 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 9522 - Posted: 18 Feb 2005, 19:58:01 UTC - in response to Message 9513.  
Last modified: 18 Feb 2005, 20:02:41 UTC

This test stuff is quite time consuming but finally I have a few results :

Memtest86 1.51 : 2 passes without errors (after 3,5 hours) - but I guess, with registered ECC RAM the board would have found errors anyway and the log was empty

Temperature : 40° and 42° (Celsius) from BIOS, 44° and 46° while running two SETI Classic tasks for 20 minutes (MBM) - board alarm temperature is set to 70°
____

Then I tried to trick the bug by running the remaining 5 minutes of trickle24 from a different machine but it seems, BOINC either doesn't like networks so much or the CPU change confused it :-/
____

Now running prime95 ... will keep this entry updated with the results in a few hours ...
ID: 9522 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 9525 - Posted: 18 Feb 2005, 20:49:27 UTC

Ananas, you said 'networks'. Are you using a network drive for the data?
Several people using Linux had problems with this, and Carl said that network drives weren't fast enough.
(Paraphrasing a bit.) Their problems were the same as yours: crashing at phase change.
This is when a LOT of disk activity happens, and any slight interruption during this time will cause a crash.

Les
ID: 9525 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 9528 - Posted: 18 Feb 2005, 21:13:47 UTC - in response to Message 9525.  
Last modified: 19 Feb 2005, 0:49:53 UTC

Prime95 self test passed, no trouble - I will let it run for a few more hours (done - still no problems).

Well, it seems that it isn't my PC and as everything else likes the hardware it would really have surprised me. Server components should be quite reliable.

Well, 3D stuff doesn't like the hardware, it has a quality video card but no 3D accelerator - not needed ona crunch-only computer ;-)


The BOINC drive on the P3 is a local drive, a partition on an 80GB/7500rpm HD - I just thought I could jump over the critical trickle by running it on a mounted drive from a different PC. I knew that it would probably cause trouble but it was worth a try, would have saved me the copying part.

The crash happened when the HD activity was over anyway.

It didn't freeze in "remote mode" but went to 0% progress. I will try to copy the whole thing to the other PC, do the trickle24 there and then go back to the P3

As long as it does not report the damage to the CPDN server, I still see a chance of repairing it somehow.
ID: 9528 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 11092 - Posted: 18 Mar 2005, 20:08:13 UTC - in response to Message 9528.  

&gt; ...
&gt; As long as it does not report the damage to the CPDN server, I still see a
&gt; chance of repairing it somehow.


Heureka - good news from me this time !!!

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/trickle.php?resultid=510945

Trickle 24 has been finished and sent back from a different machine !!! So it wasn't lost :-)

I will leave the model there now, it seems to like 1-CPU machines much better.
ID: 11092 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 11094 - Posted: 18 Mar 2005, 21:10:16 UTC

Ananas
Both your 1.266G Intel, and your AMD seem to be giving rather high integer speeds for 2 processors.
Are you overclocking them? Not a good idea with CPDN. It causes calculation errors.

Les
ID: 11094 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 11104 - Posted: 19 Mar 2005, 1:18:28 UTC - in response to Message 11094.  
Last modified: 19 Mar 2005, 1:30:53 UTC

&gt; Ananas
&gt; Both your 1.266G Intel, and your AMD seem to be giving rather high integer
&gt; speeds for 2 processors.
&gt; Are you overclocking them? Not a good idea with CPDN. It causes calculation
&gt; errors.
&gt;
&gt; Les


Both have boards without any OC capability, Tyan Tiger 2466 and Supermicro P3TDER.

The benchmark results are very jumpy on the Tyan, they go up a lot when some 3D chat or file compression runs parallel to the benchmark.

The Tualatins have quite good values, might come from the 512k cache. Currently the values are way too high though as the Banias 1600 connected the server when client_state still had the P3 information in it. The correct values are here:

http://einstein.phys.uwm.edu/show_host_detail.php?hostid=81559

http://setiweb.ssl.berkeley.edu/show_host_detail.php?hostid=654127

The Banias is the PC where the model will live now, retried to do trickle 24 at least 5 times on the P3, it froze everytime, always with nulls in some files where data should be.


Both are prime stable, I just tested them lately. Einstein and S@H are happy with them too as I always received credits for the results.


There is no(!) problem with the CPDN calculation anyway, it must be something like a pointer error or array overflow. The crash happens when it isn't doing any major math calculation at all but while filling the files for trickle 24.

The curves look OK to me on the trickle 24 graphs :

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=510945

and 23.95 trickles have been crunched on the P3
ID: 11104 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 11107 - Posted: 19 Mar 2005, 4:24:14 UTC

Well, that's computers for you. Some are just plain difficult. As long as you get your model finished OK.

Les
ID: 11107 · Report as offensive     Reply Quote

Questions and Answers : Windows : BOINC 4.19 + CPDN 4.04 PDB files?

©2024 cpdn.org