climateprediction.net (CPDN) home page
Thread 'Tasks in progress limit'

Thread 'Tasks in progress limit'

Message boards : Number crunching : Tasks in progress limit
Message board moderation

To post messages, you must log in.

AuthorMessage
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,011,722
RAC: 7,015
Message 66879 - Posted: 12 Dec 2022, 21:51:31 UTC

Just noticed something I haven't seen before at CPDN. Fired up MacOS Mojave VM to get some recently released Mac tasks and after the initial batch of 7 failed within a minute, I noticed this in the Event Log when trying to get more work:
This computer has finished a daily quota of 1 tasks

And it kept coming up on subsequent attempts to get work. Tasks in progress limit?! Was this recently implemented? If so, it's long overdue. It seems like the limit is 1 if you have 0 for Consecutive valid tasks. Otherwise it's 4 plus the number of Consecutive valid tasks.
ID: 66879 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 66880 - Posted: 12 Dec 2022, 22:41:10 UTC - in response to Message 66879.  

I think it has always been there. however, the limit is actually /cpu on the computer. And it still wasn't enough to stop machines with 64 cpus trashing that many tasks every day due to missing 32bit libraries.

The reason for the tasks crashing has been identified as the files with the initial conditions on some of the batch. Sara is going to deprecate the affected ones so they won't be reissued and a new batch will be on the way. They may not be put on the servers though till the upload problem is resolved.
ID: 66880 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,011,722
RAC: 7,015
Message 66881 - Posted: 13 Dec 2022, 9:53:39 UTC - in response to Message 66880.  

So you're saying that the limits stated in the Application detail for host page are per core not just per host?

After some looking around it does seem like you might be right. It may also explain why I've never seen it even though I've had many tasks crash and why sometimes I can get a bunch of tasks per request and other times only a few.

That's disappointing, that means nothing has changed. I really hope that after getting OIFS figured out that the project would take the time to adjust that setting so it's per host not per core. It's hard for me to imagine that getting quality and timely results to the scientists isn't a high priority for the project. Yet it's allowing for such a high failure rate. Is a solution really that difficult? I remember seeing a few different options in the BOINC server manual for settings on how to limit work to hosts. Are they that difficult to implement?
ID: 66881 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 66882 - Posted: 13 Dec 2022, 10:15:39 UTC - in response to Message 66881.  

CPDN and some other projects work on sending out a number of work units that gets back enough results to fulfil their needs. I don't know if the BOINC server code allows for increasingly strict sanctions on computers crashing tasks. It certainly doesn't seem to be handled any differently on other projects where I have looked at it. There are only a few projects with the 32bit libraries problem on Linux that CPDN has and if OIFS becomes the norm for work here then the biggest source of crashed tasks will be gone. The change to 30 day deadlines for these tasks is one I welcome and shortening deadlines still further more in line with what many other BOINC projects do would I suspect be the easier route to take to get results back quickly.
ID: 66882 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,011,722
RAC: 7,015
Message 66884 - Posted: 13 Dec 2022, 12:09:42 UTC - in response to Message 66882.  

CPDN and some other projects work on sending out a number of work units that gets back enough results to fulfil their needs.

From what I remember Glenn posting on the subject, I don't think that's the case, I think he said that the scientists do need pretty much all of the results and I don't think he was only talking about OIFS.

... if OIFS becomes the norm for work here then the biggest source of crashed tasks will be gone.

Also from what Glenn said, OIFS isn't a replacement for Hadley. Those two are different models that do different things. OIFS may take priority in the short term due to contractual obligations, but I doubt Hadley models are going anywhere.

I don't know if the BOINC server code allows for increasingly strict sanctions on computers crashing tasks. It certainly doesn't seem to be handled any differently on other projects where I have looked at it

I'll have to find that site again to see what options were listed for restricting work to PCs. I'm only familiar with the few projects that I contribute to but it seems highly doubtful that any project comes close to the failure rate of CPDN, which is mostly due to mis-configured machines. Thus it'd seem to me that CPDN needs to handle things differently than other projects.

The change to 30 day deadlines for these tasks is one I welcome ...

I agree about the deadlines for OIFS. They should also be adjusted for Hadley models to something similar, depending on the typical length of the model. So N216 might need a little longer. Deadlines are a separate problem from crashes and both need to be addressed. Short deadlines will prevent hogging of work like this PC: https://www.cpdn.org/results.php?hostid=1521318. There's no way it'll finish all of the work by the year long deadline with all of the N216s, and that was before it loaded up on a bunch of recent N144s. It might be a good idea to server side abort/cancel vast majority of tasks on that PC and let others finish the work.
ID: 66884 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 66886 - Posted: 13 Dec 2022, 12:26:46 UTC

I probably should have been clearer. In the past, they Hadley models have been sent out with enough tasks to get sufficient results back. Yes, Glen did imply that at least for this current lot of OIFS they are looking for a much higher return rate. It may mean that these do not get the statistical tools applied to them that the Hadley ones do to determine probabilities of GHG causing particular climate events etc.
ID: 66886 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,726,716
RAC: 12,672
Message 66914 - Posted: 15 Dec 2022, 13:22:34 UTC - in response to Message 66886.  

I probably should have been clearer. In the past, they Hadley models have been sent out with enough tasks to get sufficient results back. Yes, Glen did imply that at least for this current lot of OIFS they are looking for a much higher return rate. It may mean that these do not get the statistical tools applied to them that the Hadley ones do to determine probabilities of GHG causing particular climate events etc.
Yes, correct. For the current crop of OpenIFS batches if some tasks are never successful, they will get sent out again.
ID: 66914 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,011,722
RAC: 7,015
Message 66928 - Posted: 15 Dec 2022, 21:20:54 UTC - in response to Message 66914.  

Glenn, do you happen to know about Hadley models... Do they need pretty much 100% return? Do their failures get resent?
ID: 66928 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 66929 - Posted: 16 Dec 2022, 5:35:58 UTC - in response to Message 66928.  

Glenn, do you happen to know about Hadley models... Do they need pretty much 100% return? Do their failures get resent?
Depending on the batch Hadley models on Linux get either 3 or 5 tries before being given up on. They use a statistical method to look at validity of tasks and I think they get useful data on a lower return rate of about 75% or 80% but my memory of the figure may be wrong. They send out large enough batches that they get the data they need on a lower return rate.
ID: 66929 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66931 - Posted: 16 Dec 2022, 10:09:09 UTC - in response to Message 66929.  

Here is one that I got right. The previous four all failed with Negative Theta.

Workunit 12161447
name 	hadsm4_a0hi_201312_3_941_012161447
application 	UK Met Office HadSM4 at N144 resolution
created 	16 Nov 2022, 11:16:14 UTC
canonical result 	22245006
granted credit 	0.00
minimum quorum 	1
initial replication 	1
max # of error/total/success tasks 	5, 5, 1

ID: 66931 · Report as offensive     Reply Quote

Message boards : Number crunching : Tasks in progress limit

©2024 cpdn.org