Questions and Answers : Windows : exit code -5 (0xfffffffb)
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753 |
I don't think that is necessarily true. I run 4 BOINC projects, and only have unexplained trouble with CPDN. The thing with an open architecture to accomodate many projects must be to provide a standard container. Modifying the container to fit individual projects is the wrong way round. The BOINC API is published, and it should be up to the client applications to conform to it. That said, I agree, there are still issues with BOINC, but I don't think this is one. I would also agree, it would be nice if the pre-emption was such that when pre-emption was due, the client was informed and could checkpoint before being deleted, (not as big an issue if the "leave in memory" option is enabled of course). That is a wish for all projects, not just CPDN. I have a long wish list of BOINC improvements, but I think this thread is a CPDN issue. A start would be to differentiate between the various conditions that produce the catchall -5 error message, then at least people know what went wrong. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 22 Jan 05 Posts: 45 Credit: 4,685,931 RAC: 1,393 |
Hello Les, > Also, -5 is a "catch all" error message, so it isn't necessarily a file > write. > Sometimes it's caused by a negative pressure in one of the cells. But is that not a big difference? A model that failed because of a hardware glitch or a file error could have been important & valid whereas a negative pressure certainly shows the model is flawed. Do they check that and run models again if they failed because of hardware glitches etc.? > There are several threads about success rates on the phpBB, (which is down), > and one of the admins said the ratio > is about 1 in 7 successful, so don't get too discouraged. Well, I hope the scietists and their programmers now what they are doing. If they consider that acceptable, I can live with it. I am just surprised that it happens relatively often. And I am wondering if there are not strategies to compensate for the hardware-related failures. Friedrich I love CPDN! -- |
Send message Joined: 17 Aug 04 Posts: 753 Credit: 9,804,700 RAC: 0 |
The 1:7 ratio is a little misleading. Looking at the WUs that have been resent, which is possible when we are allocated one to do ourselves, it seems that most fail either immediately or within quite a short time. It is unlikely that these are the result of parameterisation, and the fact that they can generally be run by somebody else reinforces that. Clients have a download limit of 5 WUs a day. Somebody who is having trouble can easily reach this limit on successive days, perhaps without realising that there is a problem. In contrast, someone who runs CPDN successfully may go through one WU a month. Statistically, it would be possible to have a failure rate of 7 out of 8 measured on WU volume, and a success rate of 7 out of 8 measured by user. I'm not suggesting that the actual figures are that, as I haven't seen the data, but it is also true that those of us who run CPDN successfully tend to do so consistently, and where a WU does fail we know why. That is no consolation to those who are having trouble, but we should not run away with thre idea that failure is the norm. If it were, few of us would bother. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Sorry Andrew, I was attempting to quote Carl from the phpBB forum. I may have the figures wrong. I still think that adrianxw's problem lies with the fact that he is running all the projects. But I don't know why it's a problem. I only run CP with BOINC V4.05, and don't have a problem. I'll stay out of this, I think. Les |
Send message Joined: 17 Aug 04 Posts: 753 Credit: 9,804,700 RAC: 0 |
> I was attempting to quote Carl from the phpBB forum. I may have the figures > wrong. No, I don't question that. But Carl was, I think, quoting figures for units, not users - my point is that his figure could be correct and this still be a minority experience for users. Maybe Carl wants to answer for himself ;) |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
> > > I was attempting to quote Carl from the phpBB forum. I may have the > figures > > wrong. > > No, I don't question that. But Carl was, I think, quoting figures for units, > not users - my point is that his figure could be correct and this still be a > minority experience for users. > > Maybe Carl wants to answer for himself ;) > Reducing the 7:1 ratio is easy just change the maximum number of downloads per day down from 4 to 3 or 2, but this doesn't really make things any better. A much better measure is the percentage of model years that are in complete run compared to the model years in all finished runs (including incomplete runs). This figure is something like 76% in the completed runs. Better than 72.5% that classic manages. Unfortunately it is not easy to estimate or see which way if any that it is altering. 76% seems pretty good to me. Visit BOINC WIKI for help And join BOINC Synergy for all the news in one place. |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
I just caught a -5 error before it could happen. After closing (exiting) BOINC I checked the task manager and found one hadsm3um_4.04_windows_intelx86.exe still running. This is the best guarantee (I would give 100% on it!) for crashing a model with error -5 - the next BOINC restart would have destroyed the model as the still running program locked some files. Dual Athlon MP 2600+, Win2k SP4, one CPDN model and one Einstein WU running, BOINC 4.19 GUI I already lost several models from this nasty bug, it really needs to be fixed. |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
>I already lost several models from this nasty bug, it really needs to be fixed. The good news Tolu has said he has 'resolved this for good'. The bad this was on the Sulphur cycle alpha test. Visit BOINC WIKI for help And join BOINC Synergy for all the news in one place. |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
> >I already lost several models from this nasty bug, it really needs to be > fixed. > > The good news Tolu has said he has 'resolved this for good'. The bad this was > on the Sulphur cycle alpha test. This is very good news, I already have been quite frustrated as nobody seemed to care about all my nagging about this problem. I can reproduce it anytime - the problem is more to not reproduce it ;-) As I know now that it will be solved, I will be patient and not nag anymore :-) My idea for compound projects would have been to give control over the process IDs of secondary started programs back to BOINC, like a PID list that instructs BOINC, which additional tasks to kill on exit. Will it be something like this? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Chris, Did Tolu say what the problem was? As a former programer, I'm curious. Or is that nosy? :-) Les |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
> Did Tolu say what the problem was? As a former programer, I'm curious. Or is that nosy? :-) He didn't, but as another programmer I can imagine what it's likely to have been. The hadsm3_* process controls the complete model, spawning hadsm3um_* to do the work for each of the model phases. The controller can detect when the worker stops running relatively easily, but you need a custom mechanism for the worker to be able to detect an abnormal termination of the controller. The processes communicate using shared memory, and my guess is that Tolu fixed a problem with the 'are you there' handshake between the processes. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Thanks, Thyme Lawn. Les |
©2025 cpdn.org