Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
:) Just about to say that I have 7 that have just gone past 1 hour. 2 of these have failed before, both on AMD computers. These are series s8, the other 5 are series sa. So, luck of the draw. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721 |
A little digging found out of a little over 30 failures, just over half were AMD. If sample size big enough to be significant that would imply to me a higher failure rate for AMD. Edit: Got bored after finding that many, now getting back to work. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
I have a Batch 742 model that's just coming up to 3 hours of processing. Early days, I know, but well past the initial fail zone. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,748,306 RAC: 3,824 |
I have 12 workunits. All processing fine without any problems. Keep cooling down your hardware (35° celsius in Germany) and have a nice day, Bonsai911 |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721 |
35° celsius in Germany I am only running three out of four cores on my laptop as it has been so warm here in UK. I wouldn't be at all surprised if high temperatures in Europe at least are contributing to the current high failure rate, though I haven't seen suggestions that it is widespread apart from the latest batch. |
Send message Joined: 1 Apr 12 Posts: 3 Credit: 15,024,721 RAC: 9,374 |
I have about 17 or 18 failures in a row. All are 742. They fail in less than 3 minutes. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Sardis73 All of yours that I looked at had: Suspended CPDN Monitor - Suspend request from BOINC..., which usually indicates that you're using the default setting for the option Suspend when non-BOINC CPU usage is above. It may be that these models are more sensitive than most to being interrupted at a crucial moment that's about 3 minutes into the running. Try setting it to 100% to turn it off. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,967,615 RAC: 14,422 |
8 failed 5 mins after starting, all segment violation. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721 |
Another small batch 743 along the lines of 740 has been released. JUst 120 work units. Not on Front page yet. |
Send message Joined: 16 Jun 05 Posts: 16 Credit: 19,441,595 RAC: 9,251 |
Sardis73 I had 2 of the sam25 generate compute errors (Signal 11 received: Segment violation). It is easy to ignore a "sensitive" model when it generates an error and aborts. How do you detect when the "sensitive" model just gets the wrong answer and does NOT abort? Seems like something important to isolate. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Only the researchers can do that when they run various programs against the data received. With climate modeling, answers aren't just right or wrong. There's a wide range of possible answers. Which is what makes this project such a tricky little beast. **************** On reflection Perhaps I used words that have a different "obvious meaning" to others. So lets try again. Most of the failures are at about 3 minutes. (I've seen a few that were further along.) For the "average" computer speed, with the user doing something as well, and with BOINC set to start and stop the program frequently, perhaps the program is at a critical point at about 3 minutes, (saving data, swapping data across cell boundaries, etc), and just then BOINC says STOP. And then when the program is allowed to restart, data "or something else" is missing/corrupted/whatever, and the program goes to the next step in the current if/then/else decision statement and aborts. All of mine are running OK, so I don't need to worry/think about all of those that are failing. That's the researcher's job. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,967,615 RAC: 14,422 |
Another 7 from batch 742 failed this morning with segmentation error. Unfortunately this exceeds my daily task quota so the computer concerned won't be allowed to get more tasks until tomorrow. Also had one fail on my other computer, again after about 4 mins. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721 |
And 744 has been released now. Produced from sam25t 10 year restarts. Too early to say whether these will have the same high failure rate or not. |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,283,353 RAC: 13,234 |
Just to add some perspective. My Win 10 i-7 Intel computer is running 4 from the 742 batch. To check on the reported problems, having let them run for an hour, I deliberately suspended them and then resumed them. They did not fail. One has reached 1 day and 3 hours, the other 3 are at 9 hours, all with no problems. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721 |
Perhaps not surprisingly it does seem that batch 744 is also suffering from the higher than normal failure rate. I have looked at a few computers that have failed all of the tasks from 742 and 744 they have received and there is no obvious common factor in either these or those that are successful. Unless someone with more patience than I have and the skills to extract the data automagically to look at a larger dataset than I have done manually, it is not going to be easy to work this one out. |
Send message Joined: 16 Jun 05 Posts: 16 Credit: 19,441,595 RAC: 9,251 |
Only the researchers can do that when they run various programs against the data received. Thanks for the reply. It just seemed like a program bug where the results COULD possibly be from crunching some garbage. If it is a rogue pointer that points into random program code/data (does not SEGVIO), rather than the "SEGVIO" abort ... then it seems like the computed results will not be what the researcher wants. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721 |
I don't pretend to understand it but they use some complex statistical package to analyse the results and decide if any need to be discarded. The program running the tasks will eliminate some, e.g. if an impossible climate is produced. At one time a fairly common example of this was -ve theta indicating that there was a negative atmospheric pressure. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The climate models used here are from the UK Met Office, where they run on supercomputers, so it's unlikely that there's still any bugs after all of this time. But there's a large number of ancillary files, and some smaller programs to make them work with the main program. It seems that I may have both "guessed right, and guessed wrong" in what's happening. It is something that happens at about 3 minutes, but it's to do with switching from the global model to the regional model. All of which is up to the researchers that are putting various bits together, ready for people to download, to solve. We just need to crunch on and provide the data for them, both good and bad. Hopefully, some/lots of the crashes also produce the "out files", and they got uploaded. These contain info about where the programs were up to at the time they ended. And batch 742 has been paused while thinking is in progress. |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,283,353 RAC: 13,234 |
And batch 742 has been paused while thinking is in progress. |
Send message Joined: 18 Jun 05 Posts: 24 Credit: 2,500,676 RAC: 0 |
The climate models used here are from the UK Met Office, where they run on supercomputers, so it's unlikely that there's still any bugs after all of this time. At the risk of being thought disagreeable, I respectfully disagree. This situation regarding numerous failing tasks is purely the result of inadequate; nay, POOR CPDN software design, aka bugs... perhaps an entire nest of them! The entire purpose of BOINC is to enable multiple projects to be run on individual PC's, not supercomputers. Dinking around with the global settings inherent in BOINC to PERHAPS stabilize one project - i.e., CPDN - at the risk of destabilizing other BOINC-related projects - i.e., SETI, LHC, Cosmology, Milky Way, etc, etc, etc - is NOT a solution and is in fact foolhardy. The tasks may or may not contain garbage data - if they do, then it is up to the programmers to determine what that bad data may contain and adjust the operating code to compensate, OR to adjust the code creating the tasks to edit the data more courageously. In any event, comparing the operating system and processing software that may be running on whatever mainframe CPDN uses to the myriad operating systems being used by BOINC volunteers in a vain hope to stabilize CPDN is just simply useless. To reiterate a comment I made recently on this subject in another thread, NO ONE really understands what the problem is, let alone what a solution may be. To mis-quote the Bard, The fault, dear Brutus, is not in our PC's, but in CPDN, for we are underlings. |
©2024 cpdn.org