Thread 'exit code -5 (0xfffffffb)'

Author	Message
old_user12929 Send message Joined: 5 Sep 04 Posts: 8 Credit: 374,329 RAC: 0	Message 5478 - Posted: 19 Oct 2004, 11:26:00 UTC After 956 hrs. CPU time I get this: 2004-10-19 02:13:40 - Unrecoverable error for result 1rn7_000103326_0 ( - exit code -5 (0xfffffffb)) 2004-10-19 02:13:40 - Deferring communication with project for 1 minutes and 0 seconds 2004-10-19 02:13:40 - Computation for result 1rn7_000103326 finished 2004-10-19 02:13:43 - Started upload of 1rn7_000103326_0_1.zip (all zip files uploaded) New model downloaded and started. AMD Athlon MP-1800+ (dual) (no overclock) 512M RAM No screensaver used. Boinc Ver. 4.09 2 models running 100% Win 2000 Pro Any clue as to why? ID: 5478 · Reply Quote

crandles Volunteer moderator Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0	Message 5487 - Posted: 19 Oct 2004, 17:49:34 UTC Last modified: 19 Oct 2004, 17:51:52 UTC Thyme Lawn is frequently making posts like this: Exit code -5 is a catch all error code for computation errors. CPDN stresses your hardware more than anything else that you're likely to run on your system, and the most frequent causes of these errors are overclocking, overheating and flakey hardware. You might like to check out UK_Nick's hardware maintenance and hardware tests and checks stickies on the phpBB forums. http://www.climateprediction.net/board/viewtopic.php?t=2124 http://www.climateprediction.net/board/viewtopic.php?t=2126 I have seen that you wrote no overclock, but just because there is no overclock, this does not guarantee 100% stability. ID: 5487 · Reply Quote

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753	Message 5490 - Posted: 19 Oct 2004, 18:41:33 UTC - in response to Message 5487. Last modified: 19 Oct 2004, 18:51:56 UTC > Thyme Lawn is frequently making posts like this: > > Exit code -5 is a catch all error code for computation errors. CPDN stresses > your hardware more than anything else that you're likely to run on your > system, and the most frequent causes of these errors are overclocking, > overheating and flakey hardware. > > You might like to check out UK_Nick's hardware maintenance and hardware tests > and checks stickies on the phpBB forums. > http://www.climateprediction.net/board/viewtopic.php?t=2124 > > http://www.climateprediction.net/board/viewtopic.php?t=2126 > > I have seen that you wrote no overclock, but just because there is no > overclock, this does not guarantee 100% stability. > I am running a non overclocked 2.53GHz P-IV. It ran my first model fine for about 200 hours before it crashed because of a file error - nothing to do with -5. The second model I got ran for 414 hours before crashing with a -5. Since then, I have been sent 7 models, all of them have failed with a -5, and all within 10 hours, sometimes a lot less then 10 hours. During the same period, my machine has successfully crunched Seti@Home units without error, and LHC@Home units without error. Many of both in fact. These are also CPU intensive tasks. My own work on the system runs extremely CPU intense neural models - again without error. I don't know if it is significant, but my BOINC software was upgraded to 4.13 a few days back, not sure exactly when. It is always easy to blame peoples hardware, but there could be more to it then that remember. Why, if my h/w is flaky, was it not before, and is project specifically flaky? As advised in http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=1081 I have downloaded and am running the Super PI torture software. ID: 5490 · Reply Quote

Tony Wilson Send message Joined: 31 Aug 04 Posts: 5 Credit: 241,338 RAC: 0	Message 5504 - Posted: 20 Oct 2004, 6:52:00 UTC - in response to Message 5490. Last modified: 20 Oct 2004, 7:14:18 UTC I was also having this problem last week. As suggested in http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=1092, I have set the BOINC 4.13 preferences to leave the swapped out processes in memory. This might not be the cure but it has not crashed since. Tony. ID: 5504 · Reply Quote

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753	Message 5507 - Posted: 20 Oct 2004, 7:59:03 UTC Last modified: 20 Oct 2004, 9:24:00 UTC I have now run all versions of the Super PI program, results here... + 000h 00m 01s [ 16K] + 000h 00m 01s [ 32K] + 000h 00m 02s [ 64K] + 000h 00m 05s [ 128K] + 000h 00m 14s [ 256K] + 000h 00m 33s [ 512K] + 000h 01m 18s [ 1M] + 000h 03m 04s [ 2M] + 000h 07m 21s [ 4M] + 000h 16m 06s [ 8M] + 000h 37m 06s [ 16M] + 002h 53m 34s [ 32M] ... as you can see, my "flaky" hardware has no difficulty with BOINC's suggested stability test program. During the night, I have had another model from here go over with -5. At the same time it has happily crunched SETI units without fuss, (no work from LHC). I think I have the faint aroma of software staff saying "it must be hardware" in my nostrils. As a professional software engineer, I learnt to spot that many years ago!!! * EDIT * Noticed another oddity in the log... climateprediction.net - 2004-10-19 23:12:11 - Result 2xpm_100158384_0 exited with zero status but no 'finished' file climateprediction.net - 2004-10-19 23:12:11 - If this happens repeatedly you may need to reset the project. climateprediction.net - 2004-10-19 23:12:13 - Restarting result 2xpm_100158384_0 using hadsm3 version 4.04 ... don't recall seeing that bfore. ID: 5507 · Reply Quote

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753	Message 5511 - Posted: 20 Oct 2004, 13:09:13 UTC Okay, in this dialogue, I was setting my client preffs to "leave in memory"... climateprediction.net - 2004-10-20 10:38:21 - Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi climateprediction.net - 2004-10-20 10:38:24 - Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded climateprediction.net - 2004-10-20 10:38:24 - General preferences have been updated --- - 2004-10-20 10:38:24 - General prefs: from climateprediction.net (last modified 2004-10-20 10:37:16) --- - 2004-10-20 10:38:24 - General prefs: using your defaults ... when the wu was next swapped, it shows... climateprediction.net - 2004-10-20 11:31:19 - Pausing result 2xpm_100158384_0 (left in memory) LHC@home - 2004-10-20 11:31:20 - Starting result v64lhc1000prothree11s8_1051.62_1_sixvf_18_3 using sixtrack version 4.46 ... climateprediction staying in memory and an LHC wu starting. Later... SETI@home - 2004-10-20 12:59:20 - Pausing result 26ap04aa.4387.5521.959636.44_4 (left in memory) climateprediction.net - 2004-10-20 12:59:20 - Resuming result 2xpm_100158384_0 using hadsm3 version 4.04 climateprediction.net - 2004-10-20 13:12:17 - Unrecoverable error for result 2xpm_100158384_0 ( - exit code -5 (0xfffffffb)) ... a Seti wu finishes, (left in memory), and the climatepredictor resumes, shortly later it fails with the -5 error. Since I seem to be acheiving nothing here at the moment, I have detachted from the project. I will monitor the board and re-attatch as soon as it is fixed, or if anyone wants me to try anything, feel free to contact me, I have subscribed to this thread so will get an e-mail. ID: 5511 · Reply Quote

old_user1 Send message Joined: 5 Aug 04 Posts: 907 Credit: 299,864 RAC: 0	Message 5512 - Posted: 20 Oct 2004, 13:39:03 UTC - in response to Message 5511. It could be a hard drive with errors, if one (out of 200) files that CPDN tries to open has an error, and an error on a retry, than you will get a -5. I don't think there is any program out there (other than "scandisk" or "defrag") that will test your hard drive at the level that CPDN needs. ID: 5512 · Reply Quote

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753	Message 5513 - Posted: 20 Oct 2004, 16:07:51 UTC - in response to Message 5512. Last modified: 20 Oct 2004, 18:59:11 UTC > It could be a hard drive with errors, if one (out of 200) files that CPDN > tries to open has an error, and an error on a retry, than you will get a -5. > I don't think there is any program out there (other than "scandisk" or > "defrag") that will test your hard drive at the level that CPDN needs. > Hi Carl, I use Maxtor disks, and have a standalone bootable floppy with their own "PowerMax" diagnostics. Most disk manufacturers have similar tools. These are often more exhaustive then the OS tools. They are downloadable from their web sites. Suffice to say, my disks are fine. My problems /seemed/ to start when I upgraded to the 4.13 client, although I can't be sure. I happened to notice my 400+ hour unit had crashed and I had a new one. It wasn't until yesterday that I looked at it and thought, "hmmm, that hasn't got very far", I was about to tweak the project settings to give CPDN more CPU when I saw all the failed units. If the wu supply from SAH and LHC dries up, I'll reload 4.09 and see if it makes any difference. PowerMax is here... http://www.maxtor.com/en/support/downloads/powermax.html ... works with Quantum drives as well. There is a seperate version called SCSIMax if you have SCSI disks. ID: 5513 · Reply Quote

old_user156 Send message Joined: 5 Aug 04 Posts: 186 Credit: 1,612,182 RAC: 0	Message 5514 - Posted: 20 Oct 2004, 17:36:45 UTC Tracy has done two good runs and then dumped her third run half way through with an 'error code -5'. I restarted that run (Result ID 276662) from my backup, twice, and it failed again with the same error code both times after running for some hours, even though I slowed her down some. The really weird thing is she <i>didn't</i> run the same length of time since the backup each time. :? The original run crashed 10.5 hours after the backup was made, second run 8.25 hours and third run 11.4 hours. Only thing I can think of that <i>might</i> do that is a dud sector on the hard disk that wasn't accessed at quite the same time each run - so I ran checkdisk with 'scan for and attempt recovery of bad sectors' checked but I didn't see any errors. :? <a href="http://www.nmvs.dsl.pipex.com/"><img src="http://boinc.mundayweb.com/cpdn/stats.php?userID=6&team=off&trans=off"></a> ID: 5514 · Reply Quote

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753	Message 5515 - Posted: 20 Oct 2004, 19:01:06 UTC What client version are you using Nick? ID: 5515 · Reply Quote

old_user156 Send message Joined: 5 Aug 04 Posts: 186 Credit: 1,612,182 RAC: 0	Message 5521 - Posted: 20 Oct 2004, 22:36:08 UTC BOINC v4.13, CP hadsm3 v4.04 <a href="http://www.nmvs.dsl.pipex.com/"><img src="http://boinc.mundayweb.com/cpdn/stats.php?userID=6&team=off&trans=off"></a> ID: 5521 · Reply Quote

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753	Message 5529 - Posted: 21 Oct 2004, 7:49:16 UTC Okay, that is the same BOINC and CPDN client I have. I'm running XP SP2. On the BOINC board, they have suggested simply detaching and re-attaching with the same set-up rather then trying the 4.09 experiment. I'll try that first, but I really don't want to commit useless hours upon hours of CPU time just to generate error messages, (if I wanted that, I'd just leave my neural network trainer running 24/7!!!). ID: 5529 · Reply Quote

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753	Message 5540 - Posted: 21 Oct 2004, 19:02:16 UTC Okay, si I re-attached to the project and got a new wu still running the 4.13 core as suggested, it ran for about 3 CPU hours then fell over with the -5 error. Before re-attaching, I ran another set of systems diags, a full disk scan and a SiSoft Sandra to look for any oddities. The system was clean. I've detached again. ID: 5540 · Reply Quote

david gunnells Send message Joined: 1 Sep 04 Posts: 9 Credit: 549,543 RAC: 0	Message 5771 - Posted: 30 Oct 2004, 0:29:31 UTC I just saw this tonight: climateprediction.net - 2004-10-29 19:45:41 - Unrecoverable error for result 2tyf_100153469_0 ( - exit code -5 (0xfffffffb)) ~150 hours into it. Even though I don't recall seeing this error before, I'm going to run PowerMax on my Maxtor HD and Super PI and get back to this forum... david ID: 5771 · Reply Quote

old_user2101 Send message Joined: 27 Aug 04 Posts: 3 Credit: 118,160 RAC: 0	Message 7694 - Posted: 27 Jan 2005, 5:47:02 UTC I also had this problem. But since I set in global preferences 50 seconds time interval to write to disk I didn't have such error. ID: 7694 · Reply Quote

old_user5480 Send message Joined: 31 Aug 04 Posts: 3 Credit: 318,314 RAC: 0	Message 8359 - Posted: 1 Feb 2005, 19:24:34 UTC - in response to Message 5487. I also got this problem. My machine is athlon core barthon at 2.5 , 1 gb ram, assus mb. I'm running win xp sp2 and boinc 4.19 . Climate prediction crashed after around 190 hours of running the model. Here is my log: limateprediction.net - 2005-02-01 20:53:14 - Deferring communication with project for 47 minutes and 10 seconds climateprediction.net - 2005-02-01 21:30:01 - Unrecoverable error for result 3dzh_100179683_0 ( - exit code -5 (0xfffffffb)) climateprediction.net - 2005-02-01 21:30:01 - Deferring communication with project for 1 hours, 20 minutes, and 50 seconds climateprediction.net - 2005-02-01 21:30:01 - Computation for result 3dzh_100179683 finished climateprediction.net - 2005-02-01 21:30:01 - Started upload of 3dzh_100179683_0_1.zip climateprediction.net - 2005-02-01 21:30:01 - Started upload of 3dzh_100179683_0_2.zip climateprediction.net - 2005-02-01 21:30:07 - Finished upload of 3dzh_100179683_0_1.zip climateprediction.net - 2005-02-01 21:30:07 - Throughput 256 bytes/sec climateprediction.net - 2005-02-01 21:30:07 - Finished upload of 3dzh_100179683_0_2.zip climateprediction.net - 2005-02-01 21:30:07 - Throughput 8050 bytes/sec climateprediction.net - 2005-02-01 21:30:07 - Started upload of 3dzh_100179683_0_3.zip climateprediction.net - 2005-02-01 21:30:07 - Started upload of 3dzh_100179683_0_4.zip climateprediction.net - 2005-02-01 21:30:13 - Finished upload of 3dzh_100179683_0_3.zip climateprediction.net - 2005-02-01 21:30:13 - Throughput 258 bytes/sec climateprediction.net - 2005-02-01 21:30:13 - Finished upload of 3dzh_100179683_0_4.zip climateprediction.net - 2005-02-01 21:30:13 - Throughput 258 bytes/sec climateprediction.net - 2005-02-01 21:30:13 - Started upload of 3dzh_100179683_0_5.zip climateprediction.net - 2005-02-01 21:30:19 - Finished upload of 3dzh_100179683_0_5.zip climateprediction.net - 2005-02-01 21:30:19 - Throughput 8050 bytes/sec ID: 8359 · Reply Quote

Friedrich S. Send message Joined: 22 Jan 05 Posts: 45 Credit: 4,685,931 RAC: 1,393	Message 9054 - Posted: 10 Feb 2005, 2:07:17 UTC Hello, I now can join the "Team of -5", too. On my Pentium 4, 2.8 GHz HT with BOINC 4.13 & CPDN I just lost a model just after the 6th trickle. Isn't there a way to deal with it by the way the files are written in CPDN? A strategy like: 1) Write data to temp file. 2) Verify 3) Rename if verify successful, otherwise go back to 1). 4) Build model slowly in incremental files (e.g. every trickle). And on load: 1) Load file. 2) Verify. 3) Reload if unsuccsessful 4) If still unsuccsessful, load earlier time steps (of incremental files mentioned above) until you reach a stable point and restart from there. That way you would loose a trickle rather than the whole model. and it would be easier to recover, e.g. by playing back a earlier backup. Friedrich I love CPDN! -- ID: 9054 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 9063 - Posted: 10 Feb 2005, 7:47:43 UTC The data gets written to the files lots of times per trickle, the default is 60 seconds. And there ARE some built-in safeguards, but the programmers couldn't cover everything. I suspect that a lot of crashes, (of all types), are caused by a hardware hickup. People running Linux, for instance often get file errors because they use a network drive for data, and their system can't cope with the frequent data bursts. Also, -5 is a "catch all" error message, so it isn't necessarily a file write. Sometimes it's caused by a negative pressure in one of the cells. I had a -5 crash on my 1st model, and all the other 7 were successful. Plus 1 with which I had an accident and couldn't recover. (mumble, mutter). But there have been well over 23,0000 BOINC runs completed successfully. There are several threads about success rates on the phpBB, (which is down), and one of the admins said the ratio is about 1 in 7 successful, so don't get too discouraged. Les ID: 9063 · Reply Quote

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753	Message 9066 - Posted: 10 Feb 2005, 8:16:01 UTC Last modified: 10 Feb 2005, 8:22:44 UTC >>> the ratio is about 1 in 7 successful, so don't get too discouraged. I understand the sentiment, but would point out, that in my case at least, hundreds of hours have been consumed by CPDN models which have failed to finish. That same hundreds of hours could have been used for very many successful SETI, LHC and Predictor units. If CPDN is the "flaky" element, then it needs to resolve that to keep people on board. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 9066 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 9076 - Posted: 10 Feb 2005, 11:11:21 UTC adrianxw Personally, I think BOINC is the flakey part, especially when used to switch between multiple projects. I don't think Berkley has gotten the switching part quite right. As well as all the known bugs still to be fixed. They have come up with versions to fix problems with some of the other projects when used with CPDN, so maybe they need to look at CPDNs requirements. Something like: I'm going to switch now, but this is CPDN, so I need to wait until just after a save point, and THEN switch. Les ID: 9076 · Reply Quote