Thread 'Again a crash on a Dual P3/1266'

Author	Message
Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 8841 - Posted: 7 Feb 2005, 17:17:23 UTC Last modified: 7 Feb 2005, 17:21:28 UTC ... and again it happened when it tried to send trickle 24 - a complete freeze of the computer, cold reboot required. The second model that was active was still OK after the restart. The system has enough HD space and no RAM shortage (1GB registered RAM), Win2k SP4, BOINC 4.19. No other programs have been running except for the stuff Win2k needs and those two CPDN tasks under BOINC. I have saved everything before I restarted BOINC as I knew the model would be damaged after the restart. I can send any file you need to track down the bug. _________ <b>The worst damage I can see is 2t1g_100152270.xml beeing filled with 13322 null-bytes (character 0x00) and nothing else in it, looks very much like a pointer error on the buffer when trying to write the file back.</b> _________ There are a few of those too : CLOSE: WARNING: Unit 67 Not Opened OPEN: File dataout/2t1gaa.ph26c10 Created on Unit 67 but I have seen them quite often in stderr_um.txt. _________ http://oct31.de/tmp/BoincCrash.png is not from this crash, I had only CPDN running when it happened this time, no concurrent SETI task - but it was the same kind of error. _________ I have delivered a Trickle 24 this morning from a Dual MP2600+ too, running only one CPDN task. The computer acted really strange, mouse cursor flashing between hourglass and arrow like crazy and somehow I expected a crash but this time - surprise surprise - the model survived. CPDN is my favourite BOINC project but >80% crashes somehow take the fun out of it :-/ ID: 8841 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 8842 - Posted: 7 Feb 2005, 17:39:15 UTC .....mouse cursor flashing between hourglass and arrow like crazy.... at trickle 24. This is the software doing LOTS of disk writes at the end of phase 1. It will do the same at trickle 48, (end of phase 2). When I have been watching at this point, the writing has lasted several minutes, followed by a minute of nothing. If you look at the Work tab of BOINC at this time, you will see the model in question has Paused, with another word after Paused (forgotten what). It will change back to Running after the period of no activity. I've no idea what it is doing doing the "nothing" time. If you are having difficulties getting a model to finish, you should give it a fair go, and stop using the computer until cp shows Running again. Les ID: 8842 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 8847 - Posted: 7 Feb 2005, 18:23:13 UTC - in response to Message 8842. Last modified: 7 Feb 2005, 18:36:02 UTC Well, I know where the mouse cursor thing comes from - I did stop the other program (a 3D chat) that I was using on this machine - but all other computers are running unattended and I built them only for my DC hobby blush, so no other programs are running on those anyway. edit : What does an application that runs from the tray do with the mouse cursor btw. ? The crashed trickle 24 models have been on a machine that has nothing on it but the OS and BOINC. Maybe it's too many open files when 2 CPDN tasks are running or so - on a single CPU machine I had a full run - but I like my dual CPU machines and I want CPDN to work on them reliable. There is something wrong with the process handling of BOINC with compound applications on windows, easily reproducable errors, within a few seconds I can crash a model without doing anything weird. Now those errors with the null bytes in the XML files, which seems to be reproducable too (happened the second time now) but of course it takes much time to reproduce them. Other projects do run smooth, just the project I gave the highest priority keeps crashing :Â´( When SETI Classic is over I will have a few single CPU machines for CPDN - but not using the project on the dual machines isn't such a good "bugfix". ID: 8847 · Reply Quote

Thyme Lawn Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0	Message 8858 - Posted: 7 Feb 2005, 20:17:11 UTC Your <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=455966">model 2t1g_100152270</a> has been uploaded with a startup error caused by a missing file. My guess is that something went wrong during the processing at the end of phase 1 and the file <b>projects/climateprediction.net/2t1g_100152270/datain/phase2.start</b> didn't get created (or, less likely, one of the raw output files was missing). But that's based on a hunch rather than any knowledge ;) As for the cursor switching between the normal and working state, that seems to be down to the hadsm3se program which runs multiple times at the end of each phase. I ran task manager at maximum refresh during post-phase processing, and the cursor changes definitely seem to coincide with hadsm3se running. The disk access at phase end is when the output files are being changed from their raw processing format into the final .x1.nc files, and the "idle" periods happen when there are intensive post-phase calculations going on. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer ID: 8858 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 8863 - Posted: 7 Feb 2005, 20:52:37 UTC - in response to Message 8858. Last modified: 7 Feb 2005, 21:42:22 UTC > Your <a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=455966">model > 2t1g_100152270</a> has been uploaded with a startup error caused by a missing > file. Yes, I deleted all model files but have a backup - I deleted them <b>after</b> the error occured and it showed 0% and several years to completion. 2t1g_100152270.xml has already been damaged before I deleted them. > My guess is that something went wrong during the processing at the end > of phase 1 and the file > <b>projects/climateprediction.net/2t1g_100152270/datain/phase2.start</b> > didn't get created (or, less likely, one of the raw output files was missing). > But that's based on a hunch rather than any knowledge ;) phase2.start is here in my backup and it looks normal : 00: 00 80 FF FF 01 00 00 00 │ 01 00 00 00 00 00 00 00 10: 01 00 00 00 00 00 00 00 │ 01 00 00 00 02 00 00 00 20: 02 00 00 00 00 80 FF FF │ 00 80 FF FF 95 01 00 00 Three more files in datain have a timestamp really close to the crash : heatflux.ph1 phase3.start restart.end1 (6 minutes older) Maybe this gives a hint : the last two files (timestamp) it accessed in dataout are null'ed too, which seems to be not correct. Those are : 2t1gaa.pa.gmts.x1.nc and 2t1gaa.pa.rmts.x1.nc whereas 2t1gaa.pa.8yac.x1.nc has contents in the first MB, the second MB is nulled too. The other .nc files do not have those big 0x00 parts near end. The stuff in the tmp directory and the other files in dataout look not damaged - although hard to tell on binaries. Same for slots/?/init_data.xml which has the crash timestamp too but isn't damaged. One more, stderr_um.txt ends like this : OPEN: File dataout/4128aa.da168p0 Created on Unit 22 OPEN: File dataout/4128aa.da168s0 Created on Unit 22 OPEN: File data but maybe the PC froze before the stderr buffer could be completely flushed. > ... Oh well, I can live with cursor and HD activity ;-) p.s.: I doubt that it is a calculation error as it happened the second time on phase 2 startup now - with a complete windows freeze + cold reboot required. It's rather an uninitialized pointer, lack of file or memory alloc handles because they do not get closed properly or some array overflow. ID: 8863 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 22857 - Posted: 22 May 2006, 20:25:06 UTC Last modified: 22 May 2006, 20:26:03 UTC I am aware that this thread is ancient - but maybe still interesting as I just found this in the BOINC release notes : Rom 9 Jan 2006 (HEAD) - Tag for 5.3.9 release, all platforms boinc_core_release_5_3_9 Bruce 9 Jan 2006 - Fixes to BOINC zip library from Carl Christensen. Carl says: \"I found a problem with boinc_zip; it seems some Linux STL\'s aren\'t very nice about classes that are inherited from their objects on multiple use; or huge file lists that we use on CPDN. So I rewrite it to just use \"straight\" std::string\'s in a vector. It\'s fully backwardly compatible and seems to work fine.\" Maybe a fix for this old problem? Some day I\'ll try, the machine isn\'t highspeed but it is reliable and has enough RAM :-) ID: 22857 · Reply Quote