Thread 'Error while computing???'

Author	Message
Thund3rb1rd Send message Joined: 18 Jun 05 Posts: 24 Credit: 2,500,676 RAC: 0	Message 58430 - Posted: 19 Jul 2018, 10:11:30 UTC My tasks keep dying. They get just so far - in some cases VERY far - then die off. This was happening even befor the Situation. Anyone have any ideas why? ID: 58430 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 58431 - Posted: 19 Jul 2018, 11:44:28 UTC - in response to Message 58430. The only obvious thing from your Tasks list, is that you appear to be using the default setting for Suspend when non-BOINC CPU usage is above. This causes BOINC to keep stopping and starting the models, which they don't like. They're from the UK Met Office, where they run on supercomputers, and are not coded to survive constant stopping and starting. So setting this option to 100% will "turn it off", and allow the tasks to run continuously. If you find this makes your computer sluggish, then reduce the number of tasks that run at the same time. Use at most 100% of the CPUs . Make this 50%. See what these 2 changes do for that computer. After that it may be down to something that you're using that computer for. ID: 58431 · Reply Quote

Thund3rb1rd Send message Joined: 18 Jun 05 Posts: 24 Credit: 2,500,676 RAC: 0	Message 58435 - Posted: 19 Jul 2018, 19:16:53 UTC - in response to Message 58431. First, thank you for the help. I appreciate your time. I wasn't using the suspend option at all - at least, the box wasn't checked. I've activated it and set it to 100% per your comment. I wonder if not having a setting at all may have been the problem. I was already using only 75% of the CPUs, but have set that to 50%. Okay. We'll see what we see, I guess. Again, thank you for your time. ID: 58435 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 58437 - Posted: 19 Jul 2018, 22:08:39 UTC - in response to Message 58435. 75% may be OK too; it all depends on how sluggish the computer "feels". I've got a quad core which is hyper-threaded, so I limit it to 50% and just use, hopefully, the "real" cores, leaving the others for housekeeping, etc. ID: 58437 · Reply Quote

Thund3rb1rd Send message Joined: 18 Jun 05 Posts: 24 Credit: 2,500,676 RAC: 0	Message 58447 - Posted: 21 Jul 2018, 22:43:00 UTC A post in a different thread reminded me of something - Since October/November 2017, virtually every task I've attempted has died with a computing error. Before that, I had gone for at least year with virtually every task running to completion - not all, of course, but surely more than 95%. I got the occasional time-out, and the occasional Error while Computing, but by and large, I had no real problems. After that period, out of 33 tasks attempted, I've had only 3 run to completion, with 2 hung up in purgatory. In looking over my task list, it's been a mixed bag as far as which of my three boxes had problems, but one thing is certain - the problems started just about a year ago regardless of which machine was running the task. I spent too many years as a programmer to make changes willy-nilly with no evidence to back up the changes. Up until I examined the situation, I was of the opinion that my machine was messed up somehow. Now, I'm not so sure, particularly if others started having problems about that time. ID: 58447 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 58448 - Posted: 21 Jul 2018, 23:38:56 UTC - in response to Message 58447. That would make it the wah2 models. Perhaps they're more susceptible to the stop/start business. ID: 58448 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 58449 - Posted: 22 Jul 2018, 2:51:06 UTC There is something that I have been wondering about of some time. We know that CPDN does not like being started and stopped often. In fact the WU’s tend to crash. We tweak the settings to prevent repeated stops and starts due to CPU usage. But, if you run more than one project on the same machine and one of them is CPDN they are being stated and stopped frequently. The default setting is to switch between projects every 120 minutes. Wouldn’t this tend to increase the failure rate. Is it safe to run 2 or more projects alongside CPDN? ID: 58449 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 58450 - Posted: 22 Jul 2018, 5:50:04 UTC - in response to Message 58449. Yes, I think that would happen with multiple projects. But perhaps the longer time between switches lessens the chance that the program "gets caught at a bad time". I don't know where the sensitivity is, but one place may be where the calcs are paused so that the program can exchange data across the "cell" boundaries. The size of the cells for the "area of interest" is the smallest, (the number at the end of the abbreviated name, such as the current nam50), then the reset of the globe, then the ocean cells. (These latter change quite slowly.) I know that lots of people do run a mix of projects with cpdn, but I've never been interested in how successful this is. ID: 58450 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 58451 - Posted: 22 Jul 2018, 5:53:13 UTC - in response to Message 58430. My tasks keep dying. They get just so far - in some cases VERY far - then die off. This was happening even befor the Situation. I have received essentially no work units in what seems to be a year. I run Linux, so that accounts for this. But if I remember correctly, I had no trouble running up to four work units at a time on a 4-core processor or, before that on a 2 hyper-threaded Xeon processor machine. And I had no such problem. I believe the secret was to have the Leave non-GPU tasks in memory while suspended option checked. ID: 58451 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 58454 - Posted: 22 Jul 2018, 19:38:32 UTC It's been so long since I set my options that I'd forgotten about "leave in memory". Which is also important. On another note, Jean, have a read of this post: How to install Wine in Linux Mint and Ubuntu There's a couple of tricky spots, but it gets the latest version, which is currently 3.0.something. ID: 58454 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 58457 - Posted: 23 Jul 2018, 6:17:17 UTC - in response to Message 58454. Like Les, I haven't paid particular attention to any other projects apart from CPDN. However, I do run WCG when no work is available for CPDN and haven't noticed any increase in the error rate if I leave the WCG tasks running when I do get Climate Prediction work. I have on occasion with the same thoughts in mind, increased the switching time but not for a number of years. ID: 58457 · Reply Quote

Thund3rb1rd Send message Joined: 18 Jun 05 Posts: 24 Credit: 2,500,676 RAC: 0	Message 58462 - Posted: 23 Jul 2018, 17:44:51 UTC Several new posts to this thread have postulated various causes for CPDN to error out. I've checked each of my machines and have found nothing that stands out as a smoking gun. I've been running CPDN since 2005 together with as many as 10 additional BOINC projects on various versions of Wintel machines, and up until last fall had no problems with any of them, apart from the occasional hiccups which go with any BOINC project. For the most part, they have all played nicely together for more than a decade. As I remarked below, I spent too many decades as a programmer to make system changes without investigating them thoroughly beforehand. Right now, I'm running eight BOINC projects - including CPDN - and the only project I'm currently having issues with is CPDN, and I didn't start having those problems until last fall. I do not make configuration changes to BOINC without good reason. However it wants to install itself is fine with me. In all the years I've been running BOINC projects, the only global change I've ever made to all of my machines is to disable GPU use. This was the case before last fall, and is true today. I've carefully investigated each of the clues and suggestions found in previous threads and come up with zilch. I'm not denying the fault MAY lie with my machine, but if that's the case, it's with all three machines and that isn't reasonable. The theory that the problem is caused by constant switching back and forth from CPDN to other projects doesn't hold up in my experience with multiple projects. ID: 58462 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,033,074 RAC: 14,759	Message 58487 - Posted: 29 Jul 2018, 22:03:17 UTC Had 3 batch 738 models fail with this error: <core_client_version>7.12.1</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16)</message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048 Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048 Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048 Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048 Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048 Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048 Suspended CPDN Monitor - Suspend request from BOINC... Sorry, too many model crashes! :-( 00:33:07 (6124): called boinc_finish(22) </stderr_txt> ]]> ID: 58487 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 58488 - Posted: 30 Jul 2018, 7:27:21 UTC - in response to Message 58487. Had 3 batch 738 models fail with this error: 738 is one of the test batches that would if it were running have gone to the testing site. At least the one I looked at also failed on its two other attempts. I am afraid that this sort of thing is likely until the testing site is back up and running. ID: 58488 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 58489 - Posted: 30 Jul 2018, 8:24:51 UTC - in response to Message 58488. I have just had it confirmed that there was an issue with this batch and if any still out there they can be aborted. Further resends of any that haven't already failed three times will be stopped. ID: 58489 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 58490 - Posted: 30 Jul 2018, 14:28:57 UTC - in response to Message 58462. The theory that the problem is caused by constant switching back and forth from CPDN to other projects doesn't hold up in my experience with multiple projects. But your experience thus far seems to be that you do have problems with multiple projects. Have you tried running CPDN by itself? ID: 58490 · Reply Quote

flashawk Send message Joined: 29 Jun 12 Posts: 31 Credit: 1,438,478 RAC: 0	Message 58493 - Posted: 1 Aug 2018, 5:04:37 UTC - in response to Message 58490. All the new WU's are failing - wah2_sam25. These are the new ones that were just released, computation error within 3 minutes of starting. ID: 58493 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 58494 - Posted: 1 Aug 2018, 5:30:09 UTC - in response to Message 58493. All the new WU's are failing - wah2_sam25. These are the new ones that were just released, computation error within 3 minutes of starting. Checked and certainly a lot are failing. None of those I have been able to find so far have been out long enough to fail on a second machine but I found enough to justify informing the project. I imagine this will be seen in about 2-2.5 hours time. ID: 58494 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 58495 - Posted: 1 Aug 2018, 6:53:04 UTC - in response to Message 58494. At least one has failed twice with the segfault error, though not quite conclusive as one of the two had above average failure rates and the other was a brand new machine with only 6 tasks listed. From Sihan at the project, Thanks. I don't see what the problem is right away, will do some investigation. ID: 58495 · Reply Quote

flashawk Send message Joined: 29 Jun 12 Posts: 31 Credit: 1,438,478 RAC: 0	Message 58501 - Posted: 1 Aug 2018, 11:59:51 UTC - in response to Message 58495. I'm still running 8 WU's from the previous batch without any issues. All of the new ones failed within 3.5 minutes of being started, I'm downloading 11 more right now. ID: 58501 · Reply Quote