Tasks in progress limit

Author	Message
AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,780,446 RAC: 19,423	Message 66879 - Posted: 12 Dec 2022, 21:51:31 UTC Just noticed something I haven't seen before at CPDN. Fired up MacOS Mojave VM to get some recently released Mac tasks and after the initial batch of 7 failed within a minute, I noticed this in the Event Log when trying to get more work: This computer has finished a daily quota of 1 tasks And it kept coming up on subsequent attempts to get work. Tasks in progress limit?! Was this recently implemented? If so, it's long overdue. It seems like the limit is 1 if you have 0 for Consecutive valid tasks. Otherwise it's 4 plus the number of Consecutive valid tasks. ID: 66879 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,966,742 RAC: 21,869	Message 66880 - Posted: 12 Dec 2022, 22:41:10 UTC - in response to Message 66879. I think it has always been there. however, the limit is actually /cpu on the computer. And it still wasn't enough to stop machines with 64 cpus trashing that many tasks every day due to missing 32bit libraries. The reason for the tasks crashing has been identified as the files with the initial conditions on some of the batch. Sara is going to deprecate the affected ones so they won't be reissued and a new batch will be on the way. They may not be put on the servers though till the upload problem is resolved. ID: 66880 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,780,446 RAC: 19,423	Message 66881 - Posted: 13 Dec 2022, 9:53:39 UTC - in response to Message 66880. So you're saying that the limits stated in the Application detail for host page are per core not just per host? After some looking around it does seem like you might be right. It may also explain why I've never seen it even though I've had many tasks crash and why sometimes I can get a bunch of tasks per request and other times only a few. That's disappointing, that means nothing has changed. I really hope that after getting OIFS figured out that the project would take the time to adjust that setting so it's per host not per core. It's hard for me to imagine that getting quality and timely results to the scientists isn't a high priority for the project. Yet it's allowing for such a high failure rate. Is a solution really that difficult? I remember seeing a few different options in the BOINC server manual for settings on how to limit work to hosts. Are they that difficult to implement? ID: 66881 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,966,742 RAC: 21,869	Message 66882 - Posted: 13 Dec 2022, 10:15:39 UTC - in response to Message 66881. CPDN and some other projects work on sending out a number of work units that gets back enough results to fulfil their needs. I don't know if the BOINC server code allows for increasingly strict sanctions on computers crashing tasks. It certainly doesn't seem to be handled any differently on other projects where I have looked at it. There are only a few projects with the 32bit libraries problem on Linux that CPDN has and if OIFS becomes the norm for work here then the biggest source of crashed tasks will be gone. The change to 30 day deadlines for these tasks is one I welcome and shortening deadlines still further more in line with what many other BOINC projects do would I suspect be the easier route to take to get results back quickly. ID: 66882 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,780,446 RAC: 19,423	Message 66884 - Posted: 13 Dec 2022, 12:09:42 UTC - in response to Message 66882. CPDN and some other projects work on sending out a number of work units that gets back enough results to fulfil their needs. From what I remember Glenn posting on the subject, I don't think that's the case, I think he said that the scientists do need pretty much all of the results and I don't think he was only talking about OIFS. ... if OIFS becomes the norm for work here then the biggest source of crashed tasks will be gone. Also from what Glenn said, OIFS isn't a replacement for Hadley. Those two are different models that do different things. OIFS may take priority in the short term due to contractual obligations, but I doubt Hadley models are going anywhere. I don't know if the BOINC server code allows for increasingly strict sanctions on computers crashing tasks. It certainly doesn't seem to be handled any differently on other projects where I have looked at it I'll have to find that site again to see what options were listed for restricting work to PCs. I'm only familiar with the few projects that I contribute to but it seems highly doubtful that any project comes close to the failure rate of CPDN, which is mostly due to mis-configured machines. Thus it'd seem to me that CPDN needs to handle things differently than other projects. The change to 30 day deadlines for these tasks is one I welcome ... I agree about the deadlines for OIFS. They should also be adjusted for Hadley models to something similar, depending on the typical length of the model. So N216 might need a little longer. Deadlines are a separate problem from crashes and both need to be addressed. Short deadlines will prevent hogging of work like this PC: https://www.cpdn.org/results.php?hostid=1521318. There's no way it'll finish all of the work by the year long deadline with all of the N216s, and that was before it loaded up on a bunch of recent N144s. It might be a good idea to server side abort/cancel vast majority of tasks on that PC and let others finish the work. ID: 66884 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,966,742 RAC: 21,869	Message 66886 - Posted: 13 Dec 2022, 12:26:46 UTC I probably should have been clearer. In the past, they Hadley models have been sent out with enough tasks to get sufficient results back. Yes, Glen did imply that at least for this current lot of OIFS they are looking for a much higher return rate. It may mean that these do not get the statistical tools applied to them that the Hadley ones do to determine probabilities of GHG causing particular climate events etc. ID: 66886 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66914 - Posted: 15 Dec 2022, 13:22:34 UTC - in response to Message 66886. I probably should have been clearer. In the past, they Hadley models have been sent out with enough tasks to get sufficient results back. Yes, Glen did imply that at least for this current lot of OIFS they are looking for a much higher return rate. It may mean that these do not get the statistical tools applied to them that the Hadley ones do to determine probabilities of GHG causing particular climate events etc. Yes, correct. For the current crop of OpenIFS batches if some tasks are never successful, they will get sent out again. ID: 66914 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,780,446 RAC: 19,423	Message 66928 - Posted: 15 Dec 2022, 21:20:54 UTC - in response to Message 66914. Glenn, do you happen to know about Hadley models... Do they need pretty much 100% return? Do their failures get resent? ID: 66928 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,966,742 RAC: 21,869	Message 66929 - Posted: 16 Dec 2022, 5:35:58 UTC - in response to Message 66928. Glenn, do you happen to know about Hadley models... Do they need pretty much 100% return? Do their failures get resent? Depending on the batch Hadley models on Linux get either 3 or 5 tries before being given up on. They use a statistical method to look at validity of tasks and I think they get useful data on a lower return rate of about 75% or 80% but my memory of the figure may be wrong. They send out large enough batches that they get the data they need on a lower return rate. ID: 66929 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66931 - Posted: 16 Dec 2022, 10:09:09 UTC - in response to Message 66929. Here is one that I got right. The previous four all failed with Negative Theta. Workunit 12161447 name hadsm4_a0hi_201312_3_941_012161447 application UK Met Office HadSM4 at N144 resolution created 16 Nov 2022, 11:16:14 UTC canonical result 22245006 granted credit 0.00 minimum quorum 1 initial replication 1 max # of error/total/success tasks 5, 5, 1 ID: 66931 · Reply Quote