Thread 'One of my computers is hoarding tasks and I don't know why'

Author	Message
Steven Send message Joined: 28 Jun 14 Posts: 4 Credit: 8,570,955 RAC: 6	Message 67726 - Posted: 15 Jan 2023, 2:02:36 UTC Last modified: 15 Jan 2023, 2:04:51 UTC Hey, folks. One of my computers keeps getting more OIFS tasks and I really, truly don't understand why. It runs one CPDN task at a time via an app_config.xml file. It has the same work buffer settings as my other computers, which is to get 0.1 days of work plus up to an additional 1 day. The tasks take between 13 and 14 hours each, so we're definitely overshooting the "up to one extra day" rule. The only difference I can think of is that the problem child got a CPU upgrade from an i3-6100 (2c/4t) to an i5-6600 (4c/4t), but I don't see why that would cause it queue up 29 extra work units while the others maintain single digits. I ran the BOINC CPU benchmark again just in case. Made things worse, if anything. The system that has enough RAM to run two tasks only has two more waiting, and when it contacts the server, it returns with "Not requesting tasks: don't need (job cache full)", as it should. Meanwhile, when the weird one updates, it often gets two more. What could be the issue? https://www.cpdn.org/show_host_detail.php?hostid=1525938 ID: 67726 · Reply Quote

gemini8 Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,562,405 RAC: 1,841	Message 67733 - Posted: 15 Jan 2023, 9:52:41 UTC - in response to Message 67726. Hello. One of my computers keeps getting more OIFS tasks and I really, truly don't understand why. It runs one CPDN task at a time via an app_config.xml file. It has the same work buffer settings as my other computers, which is to get 0.1 days of work plus up to an additional 1 day. One of my machines has a similar behaviour on another project and downloads the full 1000 tasks per target it can get without being able to do them before the deadline. I think the Boinc Client has a hickup. - - - - - - - - - - Greetings, Jens ID: 67733 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 67734 - Posted: 15 Jan 2023, 10:20:39 UTC I've seen this sometimes too, but very rarely. When I see it, I sometimes try to set an extra Event Log option to try and see what's going on - and that immediately stops it happening, which rather defeats the object of the exercise! But it's a useful side-effect. ID: 67734 · Reply Quote

[SG]Felix Send message Joined: 4 Oct 15 Posts: 34 Credit: 9,075,151 RAC: 374	Message 67735 - Posted: 15 Jan 2023, 10:48:07 UTC Thats exactly why one of my VMs sucked such a big amount of Had WUs, same problem. I also did the benchmark, and got even more work. ID: 67735 · Reply Quote

alanb1951 Send message Joined: 31 Aug 04 Posts: 38 Credit: 9,581,380 RAC: 3,853	Message 67737 - Posted: 15 Jan 2023, 10:58:12 UTC Steven, I notice that your system with more memory is running a fairly recent BOINC client (7.20.2) whereas the others that I looked at seem to be running 7.18.1. If I recall correctly, the fix for the "use of max_concurrent may lead to excess work being fetched" problem didn't make it into the Linux client until the 7.20 versions; it may well be that if you get hold of a 7.20 client the problem will go away! With or without use of [project_]max_concurrent, I used to have to moderate CPDN downloads by using No New Tasks as the default for CPDN, cutting the number of available "CPUs" before allowing new tasks, updating, then setting No New Tasks again and resetting the CPU count (in that order!)[1] -- it would always send at least enough work to occupy every visible "CPU"... As CPDN and WCG were my only projects doing CPU work, and I could limit work downloads for WCG at the server end, I wasn't seeing the overload issue until WCG went on hiatus, at which point the alternative projects I took CPU work from started to over-load if I used [project_]max_concurrent. However, once I found a 7.20 client that stopped. Cheers - Al. [1] Not very convenient -- much care needed to make sure no GPU tasks running at the time... ID: 67737 · Reply Quote

Steven Send message Joined: 28 Jun 14 Posts: 4 Credit: 8,570,955 RAC: 6	Message 67760 - Posted: 15 Jan 2023, 19:34:22 UTC - in response to Message 67737. Last modified: 15 Jan 2023, 19:38:05 UTC The system on 7.20.2 is my personal Windows 10 computer, which doesn't run CPDN right now. The one I was talking about was this other Ubuntu machine. I see now that both the first and second host are getting too many tasks. The one that's running two tasks and behaving properly, I now realize, has been set to use 50% of the CPU (or 2 threads), so BOINC is not seeing any unused threads. The two that are getting more work are set to 100% with the limiter being project_max_concurrent in app_config. After setting the work_fetch_debug flag in the event log, I see that the problematic systems are reporting three idle CPU threads and ignoring the effect of project_max_concurrent on fetch requests. I see in the release notes that this was fixed in 7.20.0. However, the most recent version I can get from the repository on the Ubuntu computers is 7.18.1. I see there are 7.20.X source code releases on Github. How would I go about updating my client software from the zip or tar file without breaking anything? ID: 67760 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 67761 - Posted: 15 Jan 2023, 21:04:28 UTC - in response to Message 67760. Personally, I would go to https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc and install from Gianfranco's PPA. That would currently give you v7.20.5 (and future updates) with the minimum of fuss. It's kept up to date with new releases - it's a test version, so it can go wrong, but is generally reliable. ID: 67761 · Reply Quote

Steven Send message Joined: 28 Jun 14 Posts: 4 Credit: 8,570,955 RAC: 6	Message 67763 - Posted: 16 Jan 2023, 0:50:58 UTC - in response to Message 67761. Personally, I would go to https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc and install from Gianfranco's PPA. That would currently give you v7.20.5 (and future updates) with the minimum of fuss. It's kept up to date with new releases - it's a test version, so it can go wrong, but is generally reliable. This worked perfectly, thank you. Added the repository, updated and upgraded, added the client to startup applications, and now instead of getting another task, the event log says "can't fetch CPU: max concurrent job limit". Hopefully these little computers can work through their backlog without too much trouble. ID: 67763 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 67765 - Posted: 16 Jan 2023, 2:56:53 UTC I've experienced it before too on occasion. Actually, it seems to be happening now with LHC@home. I keep getting tasks for no reason, BOINC want to keep me at 40 ATLAS & 460 Theory tasks for some reason. Luckily the deadlines won't be an issue but still. I'll have to try Richard's Event log trick. ID: 67765 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 67771 - Posted: 16 Jan 2023, 15:28:39 UTC - in response to Message 67765. Last modified: 16 Jan 2023, 15:30:29 UTC Interesting. I have never had this issue even with 7.18.1 on my Linux Mint machines, using boinc direct from Mint's repo's, and I use an app_config.xml too. However, I do control my job cache by setting: Store at least 2 days of work with a 0.01 day's work, to keep a small, oft-reporting cache. Maybe that makes a difference. I vaguely remember this issue seemed to affect VMs running boinc. It has been noted by CPDN that some machines download huge numbers of tasks (deliberately or otherwise) and they are looking at imposing limits server-side. ID: 67771 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 67780 - Posted: 16 Jan 2023, 21:14:07 UTC - in response to Message 67771. Last modified: 16 Jan 2023, 21:20:12 UTC However, I do control my job cache by setting: Store at least 2 days of work with a 0.01 day's work, to keep a small, oft-reporting cache. Maybe that makes a difference. I think you are spot on with the buffer settings. From the description of the fix, I suspect one has to have a decently large work_buf_additional_days set. Otherwise, boinc will aggressively fetch each time buffer drops below work_buf_min_days, and any newly fetched work will likely bring the buffer above work_buf_min_days + work_buf_additional_days which immediately stops all fetches. If there is a big enough work_buf_additional_days, now it opens up the opportunity for boinc to repeatedly fetch from the top priority but concurrent limited project while never running the simulation to realize the work buffer is full. This could also mean setting a very small work_buf_additional_days can be a workaround for clients using web configs. I have hit this twice with WCG when I set concurrent limit for one of the apps, after running the same app_config for months. They have server side limiting which was how I solved the problem. I have a 0.2 work_buf_min_days and 0.3 work_buf_additional_days. Guess need to upgrade my client too now that a fix is available. ID: 67780 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 67833 - Posted: 18 Jan 2023, 9:28:09 UTC It has been noted by CPDN that some machines download huge numbers of tasks (deliberately or otherwise) and they are looking at imposing limits server-side. This should definitely be done. There're some machines that got so much work that they'll miss the year long deadline for Hadley models! ID: 67833 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 68070 - Posted: 27 Jan 2023, 2:36:52 UTC - in response to Message 67761. Last modified: 27 Jan 2023, 3:30:29 UTC For people that downloaded the version in PPA (7.20.5), have you run into the issue of boinc not fetching tasks even when the host is idle? My general setup is: CPDN, share 1000, max concurrent < total CPU. Universe, WCG, Asteriods, LHC, share 100 or so, max concurrent for LHC on some hosts. When the issue happens, if I manually trigger a project update for non-CPDN projects, they would all refuse to fetch work due to not being top priority for CPU project, even when there was no WU running. I have to set no new work for CPDN before other projects would fetch during updates. For the two hosts that hit the problem, I set min buffer days to 0.3 and additional buffer days to 0.2 for one but only a min buffer day of 0.01 with no additional buffer for the other. My shares for projects haven't changed for a few weeks and this only happened on 7.20.5 twice so far. However, I am only running 7.20.5 at this point, so I am not sure it's really the newer version to blame. ~~Gonna downgrade some host to see if I catch the same problem again.~~ Curious if anyone else have seen similar problems. Edit: Changed my mind since I can't stay on old version forever. Will just enable the debug flags instead. Is work_fetch_debug the right flag, or do I also need priority_debug and cpu_sched_debug? ID: 68070 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 68072 - Posted: 27 Jan 2023, 3:46:31 UTC - in response to Message 68070. Last modified: 27 Jan 2023, 4:13:11 UTC Well, I probably should have debugged this first. <work_fetch_debug> did the trick and this likely has nothing to do with version. For the top priority project that was picked, I saw this repeatedly showing up in every fetch cycle. [work_fetch] deferring work fetch; upload active I was able to observe the same issue happening with LHC when it's slowly uploading after a batch of tasks finish. Turns out if the current project picked for fetching is constantly uploading, the fetch will just be deferred and boinc won't try the next project. It's the right behavior if we assume upload should be quick, but sometimes uploading is going to take forever... Guess I need to build up a job cache big enough for the entire period of uploads just in case the project uploading ends up getting picked for fetching. ID: 68072 · Reply Quote

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228	Message 68077 - Posted: 27 Jan 2023, 10:49:05 UTC - in response to Message 68072. Well, I probably should have debugged this first. <work_fetch_debug> did the trick and this likely has nothing to do with version. For the top priority project that was picked, I saw this repeatedly showing up in every fetch cycle. [work_fetch] deferring work fetch; upload active I was able to observe the same issue happening with LHC when it's slowly uploading after a batch of tasks finish. Turns out if the current project picked for fetching is constantly uploading, the fetch will just be deferred and boinc won't try the next project. It's the right behavior if we assume upload should be quick, but sometimes uploading is going to take forever... Guess I need to build up a job cache big enough for the entire period of uploads just in case the project uploading ends up getting picked for fetching. I must have been lucky and done a work fetch during a period of project backoff. During the several periods of upload problems Iâ€™ve been receiving tasks as normal. ID: 68077 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 68078 - Posted: 27 Jan 2023, 12:04:10 UTC - in response to Message 68077. I must have been lucky and done a work fetch during a period of project backoff. During the several periods of upload problems Iâ€™ve been receiving tasks as normal. I have had no problems downloading testing branch tasks while my bored band is smoking with 9 tasks worth of zips to slowly clear. ID: 68078 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 68079 - Posted: 27 Jan 2023, 13:02:34 UTC - in response to Message 67833. It has been noted by CPDN that some machines download huge numbers of tasks (deliberately or otherwise) and they are looking at imposing limits server-side. This should definitely be done. There're some machines that got so much work that they'll miss the year long deadline for Hadley models! Yes, it's actively being tested on the CPDN dev site. From bugs or deliberate, this caused alot of lost tasks from the latest big batches. Plus we have the extra headache that the client doesn't respect the memory_bound set for the job and will start up as many tasks as free cpus (without any app_config). That causes alot of problems with low memory machines. As we go to higher memory tasks, server controls are the only viable option, with project preference controls. ID: 68079 · Reply Quote