Thread 'New work Discussion'

Author	Message
Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 58498 - Posted: 1 Aug 2018, 9:56:27 UTC :) Just about to say that I have 7 that have just gone past 1 hour. 2 of these have failed before, both on AMD computers. These are series s8, the other 5 are series sa. So, luck of the draw. ID: 58498 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721	Message 58499 - Posted: 1 Aug 2018, 10:51:25 UTC - in response to Message 58498. Last modified: 1 Aug 2018, 10:52:17 UTC A little digging found out of a little over 30 failures, just over half were AMD. If sample size big enough to be significant that would imply to me a higher failure rate for AMD. Edit: Got bored after finding that many, now getting back to work. ID: 58499 ·

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 58500 - Posted: 1 Aug 2018, 11:31:09 UTC I have a Batch 742 model that's just coming up to 3 hours of processing. Early days, I know, but well past the initial fail zone. ID: 58500 ·

Bonsai911 Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,748,306 RAC: 3,824	Message 58503 - Posted: 1 Aug 2018, 13:23:57 UTC I have 12 workunits. All processing fine without any problems. Keep cooling down your hardware (35° celsius in Germany) and have a nice day, Bonsai911 ID: 58503 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721	Message 58505 - Posted: 1 Aug 2018, 13:46:35 UTC - in response to Message 58503. 35° celsius in Germany I am only running three out of four cores on my laptop as it has been so warm here in UK. I wouldn't be at all surprised if high temperatures in Europe at least are contributing to the current high failure rate, though I haven't seen suggestions that it is widespread apart from the latest batch. ID: 58505 ·

Sardis73 Send message Joined: 1 Apr 12 Posts: 3 Credit: 15,024,721 RAC: 9,374	Message 58507 - Posted: 1 Aug 2018, 13:50:44 UTC - in response to Message 58499. I have about 17 or 18 failures in a row. All are 742. They fail in less than 3 minutes. ID: 58507 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 58509 - Posted: 1 Aug 2018, 15:05:00 UTC - in response to Message 58507. Sardis73 All of yours that I looked at had: Suspended CPDN Monitor - Suspend request from BOINC..., which usually indicates that you're using the default setting for the option Suspend when non-BOINC CPU usage is above. It may be that these models are more sensitive than most to being interrupted at a crucial moment that's about 3 minutes into the running. Try setting it to 100% to turn it off. ID: 58509 ·

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,967,615 RAC: 14,422	Message 58512 - Posted: 1 Aug 2018, 15:56:39 UTC - in response to Message 58497. Last modified: 1 Aug 2018, 16:03:38 UTC 8 failed 5 mins after starting, all segment violation. ID: 58512 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721	Message 58514 - Posted: 1 Aug 2018, 16:45:36 UTC - in response to Message 58512. Another small batch 743 along the lines of 740 has been released. JUst 120 work units. Not on Front page yet. ID: 58514 ·

rjs5 Send message Joined: 16 Jun 05 Posts: 16 Credit: 19,441,595 RAC: 9,251	Message 58515 - Posted: 2 Aug 2018, 6:32:44 UTC - in response to Message 58509. Sardis73 All of yours that I looked at had: Suspended CPDN Monitor - Suspend request from BOINC..., which usually indicates that you're using the default setting for the option Suspend when non-BOINC CPU usage is above. It may be that these models are more sensitive than most to being interrupted at a crucial moment that's about 3 minutes into the running. Try setting it to 100% to turn it off. I had 2 of the sam25 generate compute errors (Signal 11 received: Segment violation). It is easy to ignore a "sensitive" model when it generates an error and aborts. How do you detect when the "sensitive" model just gets the wrong answer and does NOT abort? Seems like something important to isolate. ID: 58515 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 58516 - Posted: 2 Aug 2018, 7:09:01 UTC - in response to Message 58515. Last modified: 2 Aug 2018, 7:39:21 UTC Only the researchers can do that when they run various programs against the data received. With climate modeling, answers aren't just right or wrong. There's a wide range of possible answers. Which is what makes this project such a tricky little beast. **************** On reflection Perhaps I used words that have a different "obvious meaning" to others. So lets try again. Most of the failures are at about 3 minutes. (I've seen a few that were further along.) For the "average" computer speed, with the user doing something as well, and with BOINC set to start and stop the program frequently, perhaps the program is at a critical point at about 3 minutes, (saving data, swapping data across cell boundaries, etc), and just then BOINC says STOP. And then when the program is allowed to restart, data "or something else" is missing/corrupted/whatever, and the program goes to the next step in the current if/then/else decision statement and aborts. All of mine are running OK, so I don't need to worry/think about all of those that are failing. That's the researcher's job. ID: 58516 ·

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,967,615 RAC: 14,422	Message 58517 - Posted: 2 Aug 2018, 8:47:59 UTC - in response to Message 58512. Last modified: 2 Aug 2018, 8:57:40 UTC Another 7 from batch 742 failed this morning with segmentation error. Unfortunately this exceeds my daily task quota so the computer concerned won't be allowed to get more tasks until tomorrow. Also had one fail on my other computer, again after about 4 mins. ID: 58517 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721	Message 58518 - Posted: 2 Aug 2018, 9:52:43 UTC - in response to Message 58517. And 744 has been released now. Produced from sam25t 10 year restarts. Too early to say whether these will have the same high failure rate or not. ID: 58518 ·

ed2353 Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,283,353 RAC: 13,234	Message 58519 - Posted: 2 Aug 2018, 11:41:12 UTC Just to add some perspective. My Win 10 i-7 Intel computer is running 4 from the 742 batch. To check on the reported problems, having let them run for an hour, I deliberately suspended them and then resumed them. They did not fail. One has reached 1 day and 3 hours, the other 3 are at 9 hours, all with no problems. ID: 58519 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721	Message 58520 - Posted: 2 Aug 2018, 12:01:53 UTC Perhaps not surprisingly it does seem that batch 744 is also suffering from the higher than normal failure rate. I have looked at a few computers that have failed all of the tasks from 742 and 744 they have received and there is no obvious common factor in either these or those that are successful. Unless someone with more patience than I have and the skills to extract the data automagically to look at a larger dataset than I have done manually, it is not going to be easy to work this one out. ID: 58520 ·

rjs5 Send message Joined: 16 Jun 05 Posts: 16 Credit: 19,441,595 RAC: 9,251	Message 58521 - Posted: 2 Aug 2018, 15:08:26 UTC - in response to Message 58516. Only the researchers can do that when they run various programs against the data received. With climate modeling, answers aren't just right or wrong. There's a wide range of possible answers. Which is what makes this project such a tricky little beast. **************** On reflection Perhaps I used words that have a different "obvious meaning" to others. So lets try again. Most of the failures are at about 3 minutes. (I've seen a few that were further along.) For the "average" computer speed, with the user doing something as well, and with BOINC set to start and stop the program frequently, perhaps the program is at a critical point at about 3 minutes, (saving data, swapping data across cell boundaries, etc), and just then BOINC says STOP. And then when the program is allowed to restart, data "or something else" is missing/corrupted/whatever, and the program goes to the next step in the current if/then/else decision statement and aborts. All of mine are running OK, so I don't need to worry/think about all of those that are failing. That's the researcher's job. Thanks for the reply. It just seemed like a program bug where the results COULD possibly be from crunching some garbage. If it is a rogue pointer that points into random program code/data (does not SEGVIO), rather than the "SEGVIO" abort ... then it seems like the computed results will not be what the researcher wants. ID: 58521 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721	Message 58522 - Posted: 2 Aug 2018, 15:18:49 UTC - in response to Message 58521. I don't pretend to understand it but they use some complex statistical package to analyse the results and decide if any need to be discarded. The program running the tasks will eliminate some, e.g. if an impossible climate is produced. At one time a fairly common example of this was -ve theta indicating that there was a negative atmospheric pressure. ID: 58522 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 58523 - Posted: 2 Aug 2018, 19:44:08 UTC - in response to Message 58521. The climate models used here are from the UK Met Office, where they run on supercomputers, so it's unlikely that there's still any bugs after all of this time. But there's a large number of ancillary files, and some smaller programs to make them work with the main program. It seems that I may have both "guessed right, and guessed wrong" in what's happening. It is something that happens at about 3 minutes, but it's to do with switching from the global model to the regional model. All of which is up to the researchers that are putting various bits together, ready for people to download, to solve. We just need to crunch on and provide the data for them, both good and bad. Hopefully, some/lots of the crashes also produce the "out files", and they got uploaded. These contain info about where the programs were up to at the time they ended. And batch 742 has been paused while thinking is in progress. ID: 58523 ·

ed2353 Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,283,353 RAC: 13,234	Message 58524 - Posted: 2 Aug 2018, 22:08:18 UTC And batch 742 has been paused while thinking is in progress. So Should I continue crunching my 742s and those in my queue? ID: 58524 ·

Thund3rb1rd Send message Joined: 18 Jun 05 Posts: 24 Credit: 2,500,676 RAC: 0	Message 58525 - Posted: 2 Aug 2018, 23:01:02 UTC The climate models used here are from the UK Met Office, where they run on supercomputers, so it's unlikely that there's still any bugs after all of this time. At the risk of being thought disagreeable, I respectfully disagree. This situation regarding numerous failing tasks is purely the result of inadequate; nay, POOR CPDN software design, aka bugs... perhaps an entire nest of them! The entire purpose of BOINC is to enable multiple projects to be run on individual PC's, not supercomputers. Dinking around with the global settings inherent in BOINC to PERHAPS stabilize one project - i.e., CPDN - at the risk of destabilizing other BOINC-related projects - i.e., SETI, LHC, Cosmology, Milky Way, etc, etc, etc - is NOT a solution and is in fact foolhardy. The tasks may or may not contain garbage data - if they do, then it is up to the programmers to determine what that bad data may contain and adjust the operating code to compensate, OR to adjust the code creating the tasks to edit the data more courageously. In any event, comparing the operating system and processing software that may be running on whatever mainframe CPDN uses to the myriad operating systems being used by BOINC volunteers in a vain hope to stabilize CPDN is just simply useless. To reiterate a comment I made recently on this subject in another thread, NO ONE really understands what the problem is, let alone what a solution may be. To mis-quote the Bard, The fault, dear Brutus, is not in our PC's, but in CPDN, for we are underlings. ID: 58525 ·