Questions and Answers :
Windows :
All errors, and since May, nothing but errors are accumulating
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 Mar 05 Posts: 8 Credit: 644,589 RAC: 408 |
It seems nothig good is happening, as all my results since May have been 0 credit and all my results have this error: Result ID 640114 Name 12xw_000070997_0 Workunit 426951 Created 19 Mar 2005 16:17:19 UTC Sent 19 Mar 2005 19:08:44 UTC Received 28 May 2005 7:27:59 UTC Server state Over Outcome Client error Client state Computing Exit status -5 (0xfffffffb) Host ID 135493 Report deadline 2 Mar 2006 0:28:44 UTC CPU time 1949466.45 stderr out 4.25 - exit code -5 (0xfffffffb) Granted credit 1701.32 Client version --- Trickle # 18 |
Send message Joined: 17 Aug 04 Posts: 753 Credit: 9,804,700 RAC: 0 |
If your WU crashed mid way and then all further ones fail to get anywhere,then it looks as though something happened to your machine around 27 May. The trouble with -5 errors is that it is a general error code, so it could be a software conflict or hardware. The best place to look for advice is probably here <a href="http://www.climateprediction.net/board/index.php?c=1">here</a>, especially in 'other problems' and 'hardware related', and also <a href="http://www.climateprediction.net/board/viewforum.php?f=4">here</a> for the thread on compatible software. |
Send message Joined: 19 Mar 05 Posts: 8 Credit: 644,589 RAC: 408 |
> If your WU crashed mid way and then all further ones fail to get > anywhere,then it looks as though something happened to your machine around 27 > May. > > The trouble with -5 errors is that it is a general error code, so it could be > a software conflict or hardware. The best place to look for advice is probably > here <a href="http://www.climateprediction.net/board/index.php?c=1">here</a>, > especially in 'other problems' and 'hardware related', and also <a> href="http://www.climateprediction.net/board/viewforum.php?f=4">here</a> for > the thread on compatible software. > > Well, somewhere back in March I went up from ME to XP, which I love except for the things (old peripherals and 16 bit apps) it can't run right enough. I think my platform is stable, it runs SETI and Protein under boinc fine, and ran seti the old way for years. I certainly stress test it enough, being a confirmed power user. The lack of problems elsewhere and the lack of a good diagnosis ia troubling -- after all, this is just a program that does some FP arithmetic and some net IO. If it gets a bad value at a checkpoint, it should send in the traces and move on, not assume hardware error. Even if there is a hardware error, the huge pool of machines should be doing redundant calculations for verification, and if two hosts fail a unit, maybe the unit shows a flaw in the underlying program. Of course, it'd be nice to tell a user if his box failes n of m units that all processed correctly on 2 other hosts. Maybe first, you should be checking to see if you have a pattern of failing on one flavor of FP CPU. Maybe this is a BOINC shortfall. Certainly, if all these CPUs were in a room at IBM or INTEL or Sun or HPO, and some started spitting out negatives, they'd figure out whether it was hardware or software. If we can't, it says the boinc thing is not there yet! |
Send message Joined: 10 Oct 04 Posts: 223 Credit: 4,664 RAC: 0 |
Hi David My commiserations. Have a look at the discussion on the other cpdn boinc message board in the number crunching section - Huh? started by Ilyanep. __________________________________________________ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
David The programs used by the other projects are just toys compared with hadsm, so there is no point in comparing CPDN with them. Also, the -5 error covers several problems not covered more exactly by other error codes. And if the program encounters a problem, it rewinds a day and tries again, then a month, and then a year. Considering that hadsm is a million+ lines of fortran written to run on 64bit supercomputers, getting it to work on desktop machines is a real feat. Met office computers <a href="http://www.meto.gov.uk/research/nwp/numerical/computers/index.html"> here.</a> Sigh. > the lack of a good diagnosis ia troubling Andrew pointed you to a forum containing the pages about diagnostic programs. If you need a more exact link, try <a href="http://www.climateprediction.net/board/viewtopic.php?t=2126"> this.</a> |
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
duplicate post |
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
> move on, not assume hardware error. Even if there is a hardware error, the > huge pool of machines should be doing redundant calculations for verification, > and if two hosts fail a unit, maybe the unit shows a flaw in the underlying > program. Of course, it'd be nice to tell a user if his box failes n of m > units that all processed correctly on 2 other hosts. Maybe first, you should There are redundant calculations on some work units, but given the long length of the model runs, and the many runs needed, they can't send each model parameter set to 4 different computers to validate individual results. |
Send message Joined: 19 Mar 05 Posts: 8 Credit: 644,589 RAC: 408 |
> David > > The programs used by the other projects are just toys compared with hadsm, so > there is no point in comparing CPDN with them. > Also, the -5 error covers several problems not covered more exactly by other > error codes. > And if the program encounters a problem, it rewinds a day and tries again, > then a month, and then a year. > Considering that hadsm is a million+ lines of fortran written to run on 64bit > supercomputers, getting it to work on desktop machines is a real feat. > Met office computers <a> href="http://www.meto.gov.uk/research/nwp/numerical/computers/index.html"> > here.</a> Sigh. > > > the lack of a good diagnosis ia troubling > Andrew pointed you to a forum containing the pages about diagnostic programs. > If you need a more exact link, try <a> href="http://www.climateprediction.net/board/viewtopic.php?t=2126"> this.</a> > > > > Aside from the lack of parallel processing units, the Athlon arithmetic abilities are about the same as any other 64 bit computer, super or otherwise, especially if you are talking about being predictable enough to program for robust calculation. Now, with a vary big problem involving prediction, maybe negative pressure is a real poswsibility, rare, but real. I do not buy that there is an undiscovered lack of predictability in the Athlon FP results. I do not buy that getting closer to the problem will improve my perspective. I started in computers in the 60's, got close enough to fix them at the gate circuit component level, got lots of work from people who were afraid of the mantissa, exponent, justify and normalize, and now I am far enough from this sort of problem to have perspective. My take is that the Internet computing model, such as is generalized by BOINC, is much like RAID5 - lots of unreliable but redundant computers. With all the wonder-stuff running around on gossamer inside these chips, never mind the hard life some systems have had (I am deep into my second power supply, and who knows what the last one did in its death throes), who should be surprised if FP units are prone to the occasional bad result; the computational model should be reasonably robust against that, or it can never reliably do any relatively large computation. I am just a volunteer host, and in that role, the only message this model should send me is that N of M of my calculations have been contradicted by other units and verified by a third to be wrong, so I probably have a flaky CPU. All I am getting is heresay, rumor, and condolences from other lepers. Now, I said a third, I did not say four, so when that is said, that is called hyperbole, an appeal to emotion like "straw man" and "you're another," used when one has no real logical argument. They have courses on this in college, too, called rhetoric, I think. I just read the book. Please eschew such. I know nobody likes to go "back to the drawing board," but maybe the Athlon intolerance is an indication of an error on one of those so many lines of code; some bit done in the wrong order in some critical cases or with too little precision. (Maybe it is just a way of beating up on friends of the cheaper underdog? Your Bentley is no good, you should have sprung for the Rolls!) Hopefully, you have read up on Horner's method (one of my favorites) and similar ways to make calculations robust, portable and fast. I am all too aware of the dangers of combining delicate computations, never mind taking these results and extrapolating from them over and over. Certainly, something as massive and delicate as climate prediction would be very sensitive to the accumulation of error. To paraphrase one researcher, climate prediction can be thrown out of whack by a bonfire at a beach party. Well, that is as close to Johnny Storm as I want to get! |
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
> Now, I said a third, I did not say four, so when that is said, that is called > hyperbole, an appeal to emotion like "straw man" and "you're another," used > when one has no real logical argument. They have courses on this in college, > too, called rhetoric, I think. I just read the book. Please eschew such. > Huh? No hyperbole was involved. I was referring to what the CPDN folks did when sending out identical WUs to multiple hosts, i.e. in the validation WUs, each was sent out to 4 hosts. Since only 1 of 7 or 8 WUs sent out are completed they sent out many thousands of validation WUs, each WU to 4 hosts, in the hopes of getting 2 or 3 of the 4 back from some of those for comparison. No reason you should have known that since you've been in the project since March, so I should have explained the "4" further in my post. |
©2024 cpdn.org