climateprediction.net home page
All errors, and since May, nothing but errors are accumulating

All errors, and since May, nothing but errors are accumulating

Questions and Answers : Windows : All errors, and since May, nothing but errors are accumulating
Message board moderation

To post messages, you must log in.

AuthorMessage
David G. Pickett

Send message
Joined: 19 Mar 05
Posts: 8
Credit: 644,589
RAC: 408
Message 13330 - Posted: 11 Jun 2005, 11:06:49 UTC

It seems nothig good is happening, as all my results since May have been 0 credit and all my results have this error:

Result ID 640114
Name 12xw_000070997_0
Workunit 426951
Created 19 Mar 2005 16:17:19 UTC
Sent 19 Mar 2005 19:08:44 UTC
Received 28 May 2005 7:27:59 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -5 (0xfffffffb)
Host ID 135493
Report deadline 2 Mar 2006 0:28:44 UTC
CPU time 1949466.45
stderr out 4.25
- exit code -5 (0xfffffffb)



Granted credit 1701.32
Client version ---
Trickle # 18

ID: 13330 · Report as offensive     Reply Quote
Profile Andrew Hingston
Volunteer moderator

Send message
Joined: 17 Aug 04
Posts: 753
Credit: 9,804,700
RAC: 0
Message 13332 - Posted: 11 Jun 2005, 11:43:04 UTC
Last modified: 11 Jun 2005, 11:44:01 UTC

If your WU crashed mid way and then all further ones fail to get anywhere,then it looks as though something happened to your machine around 27 May.

The trouble with -5 errors is that it is a general error code, so it could be a software conflict or hardware. The best place to look for advice is probably here <a href="http://www.climateprediction.net/board/index.php?c=1">here</a>, especially in 'other problems' and 'hardware related', and also <a href="http://www.climateprediction.net/board/viewforum.php?f=4">here</a> for the thread on compatible software.
ID: 13332 · Report as offensive     Reply Quote
David G. Pickett

Send message
Joined: 19 Mar 05
Posts: 8
Credit: 644,589
RAC: 408
Message 13345 - Posted: 12 Jun 2005, 1:15:46 UTC - in response to Message 13332.  

&gt; If your WU crashed mid way and then all further ones fail to get
&gt; anywhere,then it looks as though something happened to your machine around 27
&gt; May.
&gt;
&gt; The trouble with -5 errors is that it is a general error code, so it could be
&gt; a software conflict or hardware. The best place to look for advice is probably
&gt; here <a href="http://www.climateprediction.net/board/index.php?c=1">here</a>,
&gt; especially in 'other problems' and 'hardware related', and also <a> href="http://www.climateprediction.net/board/viewforum.php?f=4"&gt;here</a> for
&gt; the thread on compatible software.
&gt;
&gt;
Well, somewhere back in March I went up from ME to XP, which I love except for the things (old peripherals and 16 bit apps) it can't run right enough. I think my platform is stable, it runs SETI and Protein under boinc fine, and ran seti the old way for years. I certainly stress test it enough, being a confirmed power user.

The lack of problems elsewhere and the lack of a good diagnosis ia troubling -- after all, this is just a program that does some FP arithmetic and some net IO. If it gets a bad value at a checkpoint, it should send in the traces and move on, not assume hardware error. Even if there is a hardware error, the huge pool of machines should be doing redundant calculations for verification, and if two hosts fail a unit, maybe the unit shows a flaw in the underlying program. Of course, it'd be nice to tell a user if his box failes n of m units that all processed correctly on 2 other hosts. Maybe first, you should be checking to see if you have a pattern of failing on one flavor of FP CPU. Maybe this is a BOINC shortfall. Certainly, if all these CPUs were in a room at IBM or INTEL or Sun or HPO, and some started spitting out negatives, they'd figure out whether it was hardware or software. If we can't, it says the boinc thing is not there yet!
ID: 13345 · Report as offensive     Reply Quote
old_user23880
Volunteer tester

Send message
Joined: 10 Oct 04
Posts: 223
Credit: 4,664
RAC: 0
Message 13346 - Posted: 12 Jun 2005, 1:34:10 UTC

Hi David

My commiserations. Have a look at the discussion on the other cpdn boinc message board in the number crunching section - Huh? started by Ilyanep.
__________________________________________________

ID: 13346 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 13352 - Posted: 12 Jun 2005, 3:47:40 UTC

David

The programs used by the other projects are just toys compared with hadsm, so there is no point in comparing CPDN with them.
Also, the -5 error covers several problems not covered more exactly by other error codes.
And if the program encounters a problem, it rewinds a day and tries again, then a month, and then a year.
Considering that hadsm is a million+ lines of fortran written to run on 64bit supercomputers, getting it to work on desktop machines is a real feat.
Met office computers <a href="http://www.meto.gov.uk/research/nwp/numerical/computers/index.html"> here.</a> Sigh.

&gt; the lack of a good diagnosis ia troubling
Andrew pointed you to a forum containing the pages about diagnostic programs.
If you need a more exact link, try <a href="http://www.climateprediction.net/board/viewtopic.php?t=2126"> this.</a>


ID: 13352 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2184
Credit: 64,822,615
RAC: 5,275
Message 13354 - Posted: 12 Jun 2005, 5:09:35 UTC - in response to Message 13345.  
Last modified: 13 Jun 2005, 3:16:06 UTC

duplicate post
ID: 13354 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2184
Credit: 64,822,615
RAC: 5,275
Message 13355 - Posted: 12 Jun 2005, 5:14:36 UTC - in response to Message 13345.  

&gt; move on, not assume hardware error. Even if there is a hardware error, the
&gt; huge pool of machines should be doing redundant calculations for verification,
&gt; and if two hosts fail a unit, maybe the unit shows a flaw in the underlying
&gt; program. Of course, it'd be nice to tell a user if his box failes n of m
&gt; units that all processed correctly on 2 other hosts. Maybe first, you should


There are redundant calculations on some work units, but given the long length of the model runs, and the many runs needed, they can't send each model parameter set to 4 different computers to validate individual results.
ID: 13355 · Report as offensive     Reply Quote
David G. Pickett

Send message
Joined: 19 Mar 05
Posts: 8
Credit: 644,589
RAC: 408
Message 13398 - Posted: 13 Jun 2005, 2:29:21 UTC - in response to Message 13352.  

&gt; David
&gt;
&gt; The programs used by the other projects are just toys compared with hadsm, so
&gt; there is no point in comparing CPDN with them.
&gt; Also, the -5 error covers several problems not covered more exactly by other
&gt; error codes.
&gt; And if the program encounters a problem, it rewinds a day and tries again,
&gt; then a month, and then a year.
&gt; Considering that hadsm is a million+ lines of fortran written to run on 64bit
&gt; supercomputers, getting it to work on desktop machines is a real feat.
&gt; Met office computers <a> href="http://www.meto.gov.uk/research/nwp/numerical/computers/index.html"&gt;
&gt; here.</a> Sigh.
&gt;
&gt; &gt; the lack of a good diagnosis ia troubling
&gt; Andrew pointed you to a forum containing the pages about diagnostic programs.
&gt; If you need a more exact link, try <a> href="http://www.climateprediction.net/board/viewtopic.php?t=2126"&gt; this.</a>
&gt;
&gt;
&gt;
&gt;

Aside from the lack of parallel processing units, the Athlon arithmetic abilities are about the same as any other 64 bit computer, super or otherwise, especially if you are talking about being predictable enough to program for robust calculation. Now, with a vary big problem involving prediction, maybe negative pressure is a real poswsibility, rare, but real. I do not buy that there is an undiscovered lack of predictability in the Athlon FP results. I do not buy that getting closer to the problem will improve my perspective. I started in computers in the 60's, got close enough to fix them at the gate circuit component level, got lots of work from people who were afraid of the mantissa, exponent, justify and normalize, and now I am far enough from this sort of problem to have perspective.

My take is that the Internet computing model, such as is generalized by BOINC, is much like RAID5 - lots of unreliable but redundant computers. With all the wonder-stuff running around on gossamer inside these chips, never mind the hard life some systems have had (I am deep into my second power supply, and who knows what the last one did in its death throes), who should be surprised if FP units are prone to the occasional bad result; the computational model should be reasonably robust against that, or it can never reliably do any relatively large computation.

I am just a volunteer host, and in that role, the only message this model should send me is that N of M of my calculations have been contradicted by other units and verified by a third to be wrong, so I probably have a flaky CPU. All I am getting is heresay, rumor, and condolences from other lepers.

Now, I said a third, I did not say four, so when that is said, that is called hyperbole, an appeal to emotion like "straw man" and "you're another," used when one has no real logical argument. They have courses on this in college, too, called rhetoric, I think. I just read the book. Please eschew such.

I know nobody likes to go "back to the drawing board," but maybe the Athlon intolerance is an indication of an error on one of those so many lines of code; some bit done in the wrong order in some critical cases or with too little precision. (Maybe it is just a way of beating up on friends of the cheaper underdog? Your Bentley is no good, you should have sprung for the Rolls!) Hopefully, you have read up on Horner's method (one of my favorites) and similar ways to make calculations robust, portable and fast. I am all too aware of the dangers of combining delicate computations, never mind taking these results and extrapolating from them over and over. Certainly, something as massive and delicate as climate prediction would be very sensitive to the accumulation of error. To paraphrase one researcher, climate prediction can be thrown out of whack by a bonfire at a beach party.

Well, that is as close to Johnny Storm as I want to get!
ID: 13398 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2184
Credit: 64,822,615
RAC: 5,275
Message 13404 - Posted: 13 Jun 2005, 5:11:12 UTC - in response to Message 13398.  

&gt; Now, I said a third, I did not say four, so when that is said, that is called
&gt; hyperbole, an appeal to emotion like "straw man" and "you're another," used
&gt; when one has no real logical argument. They have courses on this in college,
&gt; too, called rhetoric, I think. I just read the book. Please eschew such.
&gt;
Huh? No hyperbole was involved. I was referring to what the CPDN folks did when sending out identical WUs to multiple hosts, i.e. in the validation WUs, each was sent out to 4 hosts. Since only 1 of 7 or 8 WUs sent out are completed they sent out many thousands of validation WUs, each WU to 4 hosts, in the hopes of getting 2 or 3 of the 4 back from some of those for comparison. No reason you should have known that since you've been in the project since March, so I should have explained the "4" further in my post.
ID: 13404 · Report as offensive     Reply Quote

Questions and Answers : Windows : All errors, and since May, nothing but errors are accumulating

©2024 cpdn.org