Message boards : Number crunching : One of my computers is hoarding tasks and I don't know why
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Jun 14 Posts: 4 Credit: 8,570,955 RAC: 6 |
Hey, folks. One of my computers keeps getting more OIFS tasks and I really, truly don't understand why. It runs one CPDN task at a time via an app_config.xml file. It has the same work buffer settings as my other computers, which is to get 0.1 days of work plus up to an additional 1 day. The tasks take between 13 and 14 hours each, so we're definitely overshooting the "up to one extra day" rule. The only difference I can think of is that the problem child got a CPU upgrade from an i3-6100 (2c/4t) to an i5-6600 (4c/4t), but I don't see why that would cause it queue up *29* extra work units while the others maintain single digits. I ran the BOINC CPU benchmark again just in case. Made things worse, if anything. The system that has enough RAM to run two tasks only has two more waiting, and when it contacts the server, it returns with "Not requesting tasks: don't need (job cache full)", as it should. Meanwhile, when the weird one updates, it often gets two more. What could be the issue? https://www.cpdn.org/show_host_detail.php?hostid=1525938 |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,488,619 RAC: 2,087 |
Hello. One of my computers keeps getting more OIFS tasks and I really, truly don't understand why. It runs one CPDN task at a time via an app_config.xml file. It has the same work buffer settings as my other computers, which is to get 0.1 days of work plus up to an additional 1 day. One of my machines has a similar behaviour on another project and downloads the full 1000 tasks per target it can get without being able to do them before the deadline. I think the Boinc Client has a hickup. - - - - - - - - - - Greetings, Jens |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,716,561 RAC: 8,355 |
I've seen this sometimes too, but very rarely. When I see it, I sometimes try to set an extra Event Log option to try and see what's going on - and that immediately stops it happening, which rather defeats the object of the exercise! But it's a useful side-effect. |
Send message Joined: 4 Oct 15 Posts: 34 Credit: 9,075,151 RAC: 374 |
Thats exactly why one of my VMs sucked such a big amount of Had WUs, same problem. I also did the benchmark, and got even more work. |
Send message Joined: 31 Aug 04 Posts: 37 Credit: 9,581,380 RAC: 3,853 |
Steven, I notice that your system with more memory is running a fairly recent BOINC client (7.20.2) whereas the others that I looked at seem to be running 7.18.1. If I recall correctly, the fix for the "use of max_concurrent may lead to excess work being fetched" problem didn't make it into the Linux client until the 7.20 versions; it may well be that if you get hold of a 7.20 client the problem will go away! With or without use of [project_]max_concurrent, I used to have to moderate CPDN downloads by using No New Tasks as the default for CPDN, cutting the number of available "CPUs" before allowing new tasks, updating, then setting No New Tasks again and resetting the CPU count (in that order!)[1] -- it would always send at least enough work to occupy every visible "CPU"... As CPDN and WCG were my only projects doing CPU work, and I could limit work downloads for WCG at the server end, I wasn't seeing the overload issue until WCG went on hiatus, at which point the alternative projects I took CPU work from started to over-load if I used [project_]max_concurrent. However, once I found a 7.20 client that stopped. Cheers - Al. [1] Not very convenient -- much care needed to make sure no GPU tasks running at the time... |
Send message Joined: 28 Jun 14 Posts: 4 Credit: 8,570,955 RAC: 6 |
The system on 7.20.2 is my personal Windows 10 computer, which doesn't run CPDN right now. The one I was talking about was this other Ubuntu machine. I see now that both the first and second host are getting too many tasks. The one that's running two tasks and behaving properly, I now realize, has been set to use 50% of the CPU (or 2 threads), so BOINC is not seeing any unused threads. The two that are getting more work are set to 100% with the limiter being project_max_concurrent in app_config. After setting the work_fetch_debug flag in the event log, I see that the problematic systems are reporting three idle CPU threads and ignoring the effect of project_max_concurrent on fetch requests. I see in the release notes that this was fixed in 7.20.0. However, the most recent version I can get from the repository on the Ubuntu computers is 7.18.1. I see there are 7.20.X source code releases on Github. How would I go about updating my client software from the zip or tar file without breaking anything? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,716,561 RAC: 8,355 |
Personally, I would go to https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc and install from Gianfranco's PPA. That would currently give you v7.20.5 (and future updates) with the minimum of fuss. It's kept up to date with new releases - it's a test version, so it can go wrong, but is generally reliable. |
Send message Joined: 28 Jun 14 Posts: 4 Credit: 8,570,955 RAC: 6 |
Personally, I would go to https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc and install from Gianfranco's PPA. That would currently give you v7.20.5 (and future updates) with the minimum of fuss. It's kept up to date with new releases - it's a test version, so it can go wrong, but is generally reliable. This worked perfectly, thank you. Added the repository, updated and upgraded, added the client to startup applications, and now instead of getting another task, the event log says "can't fetch CPU: max concurrent job limit". Hopefully these little computers can work through their backlog without too much trouble. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,884,880 RAC: 19,188 |
I've experienced it before too on occasion. Actually, it seems to be happening now with LHC@home. I keep getting tasks for no reason, BOINC want to keep me at 40 ATLAS & 460 Theory tasks for some reason. Luckily the deadlines won't be an issue but still. I'll have to try Richard's Event log trick. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Interesting. I have never had this issue even with 7.18.1 on my Linux Mint machines, using boinc direct from Mint's repo's, and I use an app_config.xml too. However, I do control my job cache by setting: Store at least 2 days of work with a 0.01 day's work, to keep a small, oft-reporting cache. Maybe that makes a difference. I vaguely remember this issue seemed to affect VMs running boinc. It has been noted by CPDN that some machines download huge numbers of tasks (deliberately or otherwise) and they are looking at imposing limits server-side. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,979,489 RAC: 68,046 |
However, I do control my job cache by setting: Store at least 2 days of work with a 0.01 day's work, to keep a small, oft-reporting cache. Maybe that makes a difference. I think you are spot on with the buffer settings. From the description of the fix, I suspect one has to have a decently large work_buf_additional_days set. Otherwise, boinc will aggressively fetch each time buffer drops below work_buf_min_days, and any newly fetched work will likely bring the buffer above work_buf_min_days + work_buf_additional_days which immediately stops all fetches. If there is a big enough work_buf_additional_days, now it opens up the opportunity for boinc to repeatedly fetch from the top priority but concurrent limited project while never running the simulation to realize the work buffer is full. This could also mean setting a very small work_buf_additional_days can be a workaround for clients using web configs. I have hit this twice with WCG when I set concurrent limit for one of the apps, after running the same app_config for months. They have server side limiting which was how I solved the problem. I have a 0.2 work_buf_min_days and 0.3 work_buf_additional_days. Guess need to upgrade my client too now that a fix is available. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,884,880 RAC: 19,188 |
It has been noted by CPDN that some machines download huge numbers of tasks (deliberately or otherwise) and they are looking at imposing limits server-side. This should definitely be done. There're some machines that got so much work that they'll miss the year long deadline for Hadley models! |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,979,489 RAC: 68,046 |
For people that downloaded the version in PPA (7.20.5), have you run into the issue of boinc not fetching tasks even when the host is idle? My general setup is: CPDN, share 1000, max concurrent < total CPU. Universe, WCG, Asteriods, LHC, share 100 or so, max concurrent for LHC on some hosts. When the issue happens, if I manually trigger a project update for non-CPDN projects, they would all refuse to fetch work due to not being top priority for CPU project, even when there was no WU running. I have to set no new work for CPDN before other projects would fetch during updates. For the two hosts that hit the problem, I set min buffer days to 0.3 and additional buffer days to 0.2 for one but only a min buffer day of 0.01 with no additional buffer for the other. My shares for projects haven't changed for a few weeks and this only happened on 7.20.5 twice so far. However, I am only running 7.20.5 at this point, so I am not sure it's really the newer version to blame. Edit: Changed my mind since I can't stay on old version forever. Will just enable the debug flags instead. Is work_fetch_debug the right flag, or do I also need priority_debug and cpu_sched_debug? |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,979,489 RAC: 68,046 |
Well, I probably should have debugged this first. <work_fetch_debug> did the trick and this likely has nothing to do with version. For the top priority project that was picked, I saw this repeatedly showing up in every fetch cycle. [work_fetch] deferring work fetch; upload active I was able to observe the same issue happening with LHC when it's slowly uploading after a batch of tasks finish. Turns out if the current project picked for fetching is constantly uploading, the fetch will just be deferred and boinc won't try the next project. It's the right behavior if we assume upload should be quick, but sometimes uploading is going to take forever... Guess I need to build up a job cache big enough for the entire period of uploads just in case the project uploading ends up getting picked for fetching. |
Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228 |
Well, I probably should have debugged this first. <work_fetch_debug> did the trick and this likely has nothing to do with version. For the top priority project that was picked, I saw this repeatedly showing up in every fetch cycle. I must have been lucky and done a work fetch during a period of project backoff. During the several periods of upload problems I’ve been receiving tasks as normal. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I must have been lucky and done a work fetch during a period of project backoff. During the several periods of upload problems I’ve been receiving tasks as normal.I have had no problems downloading testing branch tasks while my bored band is smoking with 9 tasks worth of zips to slowly clear. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Yes, it's actively being tested on the CPDN dev site. From bugs or deliberate, this caused alot of lost tasks from the latest big batches. Plus we have the extra headache that the client doesn't respect the memory_bound set for the job and will start up as many tasks as free cpus (without any app_config). That causes alot of problems with low memory machines. As we go to higher memory tasks, server controls are the only viable option, with project preference controls.It has been noted by CPDN that some machines download huge numbers of tasks (deliberately or otherwise) and they are looking at imposing limits server-side.This should definitely be done. There're some machines that got so much work that they'll miss the year long deadline for Hadley models! |
©2024 cpdn.org