climateprediction.net (CPDN) home page
Thread 'One of my computers is hoarding tasks and I don't know why'

Thread 'One of my computers is hoarding tasks and I don't know why'

Message boards : Number crunching : One of my computers is hoarding tasks and I don't know why
Message board moderation

To post messages, you must log in.

AuthorMessage
Steven

Send message
Joined: 28 Jun 14
Posts: 4
Credit: 8,570,955
RAC: 6
Message 67726 - Posted: 15 Jan 2023, 2:02:36 UTC
Last modified: 15 Jan 2023, 2:04:51 UTC

Hey, folks.

One of my computers keeps getting more OIFS tasks and I really, truly don't understand why. It runs one CPDN task at a time via an app_config.xml file. It has the same work buffer settings as my other computers, which is to get 0.1 days of work plus up to an additional 1 day. The tasks take between 13 and 14 hours each, so we're definitely overshooting the "up to one extra day" rule. The only difference I can think of is that the problem child got a CPU upgrade from an i3-6100 (2c/4t) to an i5-6600 (4c/4t), but I don't see why that would cause it queue up *29* extra work units while the others maintain single digits. I ran the BOINC CPU benchmark again just in case. Made things worse, if anything.

The system that has enough RAM to run two tasks only has two more waiting, and when it contacts the server, it returns with "Not requesting tasks: don't need (job cache full)", as it should. Meanwhile, when the weird one updates, it often gets two more. What could be the issue?

https://www.cpdn.org/show_host_detail.php?hostid=1525938

ID: 67726 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,478,679
RAC: 1,748
Message 67733 - Posted: 15 Jan 2023, 9:52:41 UTC - in response to Message 67726.  

Hello.
One of my computers keeps getting more OIFS tasks and I really, truly don't understand why. It runs one CPDN task at a time via an app_config.xml file. It has the same work buffer settings as my other computers, which is to get 0.1 days of work plus up to an additional 1 day.

One of my machines has a similar behaviour on another project and downloads the full 1000 tasks per target it can get without being able to do them before the deadline.
I think the Boinc Client has a hickup.
- - - - - - - - - -
Greetings, Jens
ID: 67733 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 67734 - Posted: 15 Jan 2023, 10:20:39 UTC

I've seen this sometimes too, but very rarely. When I see it, I sometimes try to set an extra Event Log option to try and see what's going on - and that immediately stops it happening, which rather defeats the object of the exercise! But it's a useful side-effect.
ID: 67734 · Report as offensive     Reply Quote
[SG]Felix

Send message
Joined: 4 Oct 15
Posts: 34
Credit: 9,075,151
RAC: 374
Message 67735 - Posted: 15 Jan 2023, 10:48:07 UTC

Thats exactly why one of my VMs sucked such a big amount of Had WUs, same problem. I also did the benchmark, and got even more work.
ID: 67735 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 67737 - Posted: 15 Jan 2023, 10:58:12 UTC

Steven,

I notice that your system with more memory is running a fairly recent BOINC client (7.20.2) whereas the others that I looked at seem to be running 7.18.1.

If I recall correctly, the fix for the "use of max_concurrent may lead to excess work being fetched" problem didn't make it into the Linux client until the 7.20 versions; it may well be that if you get hold of a 7.20 client the problem will go away!

With or without use of [project_]max_concurrent, I used to have to moderate CPDN downloads by using No New Tasks as the default for CPDN, cutting the number of available "CPUs" before allowing new tasks, updating, then setting No New Tasks again and resetting the CPU count (in that order!)[1] -- it would always send at least enough work to occupy every visible "CPU"... As CPDN and WCG were my only projects doing CPU work, and I could limit work downloads for WCG at the server end, I wasn't seeing the overload issue until WCG went on hiatus, at which point the alternative projects I took CPU work from started to over-load if I used [project_]max_concurrent. However, once I found a 7.20 client that stopped.

Cheers - Al.

[1] Not very convenient -- much care needed to make sure no GPU tasks running at the time...
ID: 67737 · Report as offensive     Reply Quote
Steven

Send message
Joined: 28 Jun 14
Posts: 4
Credit: 8,570,955
RAC: 6
Message 67760 - Posted: 15 Jan 2023, 19:34:22 UTC - in response to Message 67737.  
Last modified: 15 Jan 2023, 19:38:05 UTC

The system on 7.20.2 is my personal Windows 10 computer, which doesn't run CPDN right now. The one I was talking about was this other Ubuntu machine.

I see now that both the first and second host are getting too many tasks. The one that's running two tasks and behaving properly, I now realize, has been set to use 50% of the CPU (or 2 threads), so BOINC is not seeing any unused threads. The two that are getting more work are set to 100% with the limiter being project_max_concurrent in app_config. After setting the work_fetch_debug flag in the event log, I see that the problematic systems are reporting three idle CPU threads and ignoring the effect of project_max_concurrent on fetch requests.

I see in the release notes that this was fixed in 7.20.0. However, the most recent version I can get from the repository on the Ubuntu computers is 7.18.1. I see there are 7.20.X source code releases on Github. How would I go about updating my client software from the zip or tar file without breaking anything?
ID: 67760 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 67761 - Posted: 15 Jan 2023, 21:04:28 UTC - in response to Message 67760.  

Personally, I would go to https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc and install from Gianfranco's PPA. That would currently give you v7.20.5 (and future updates) with the minimum of fuss. It's kept up to date with new releases - it's a test version, so it can go wrong, but is generally reliable.
ID: 67761 · Report as offensive     Reply Quote
Steven

Send message
Joined: 28 Jun 14
Posts: 4
Credit: 8,570,955
RAC: 6
Message 67763 - Posted: 16 Jan 2023, 0:50:58 UTC - in response to Message 67761.  

Personally, I would go to https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc and install from Gianfranco's PPA. That would currently give you v7.20.5 (and future updates) with the minimum of fuss. It's kept up to date with new releases - it's a test version, so it can go wrong, but is generally reliable.


This worked perfectly, thank you. Added the repository, updated and upgraded, added the client to startup applications, and now instead of getting another task, the event log says "can't fetch CPU: max concurrent job limit". Hopefully these little computers can work through their backlog without too much trouble.
ID: 67763 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,819,420
RAC: 19,777
Message 67765 - Posted: 16 Jan 2023, 2:56:53 UTC

I've experienced it before too on occasion. Actually, it seems to be happening now with LHC@home. I keep getting tasks for no reason, BOINC want to keep me at 40 ATLAS & 460 Theory tasks for some reason. Luckily the deadlines won't be an issue but still. I'll have to try Richard's Event log trick.
ID: 67765 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67771 - Posted: 16 Jan 2023, 15:28:39 UTC - in response to Message 67765.  
Last modified: 16 Jan 2023, 15:30:29 UTC

Interesting. I have never had this issue even with 7.18.1 on my Linux Mint machines, using boinc direct from Mint's repo's, and I use an app_config.xml too. However, I do control my job cache by setting: Store at least 2 days of work with a 0.01 day's work, to keep a small, oft-reporting cache. Maybe that makes a difference. I vaguely remember this issue seemed to affect VMs running boinc.

It has been noted by CPDN that some machines download huge numbers of tasks (deliberately or otherwise) and they are looking at imposing limits server-side.
ID: 67771 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,676,976
RAC: 61,408
Message 67780 - Posted: 16 Jan 2023, 21:14:07 UTC - in response to Message 67771.  
Last modified: 16 Jan 2023, 21:20:12 UTC

However, I do control my job cache by setting: Store at least 2 days of work with a 0.01 day's work, to keep a small, oft-reporting cache. Maybe that makes a difference.

I think you are spot on with the buffer settings. From the description of the fix, I suspect one has to have a decently large work_buf_additional_days set. Otherwise, boinc will aggressively fetch each time buffer drops below work_buf_min_days, and any newly fetched work will likely bring the buffer above work_buf_min_days + work_buf_additional_days which immediately stops all fetches. If there is a big enough work_buf_additional_days, now it opens up the opportunity for boinc to repeatedly fetch from the top priority but concurrent limited project while never running the simulation to realize the work buffer is full. This could also mean setting a very small work_buf_additional_days can be a workaround for clients using web configs.

I have hit this twice with WCG when I set concurrent limit for one of the apps, after running the same app_config for months. They have server side limiting which was how I solved the problem. I have a 0.2 work_buf_min_days and 0.3 work_buf_additional_days. Guess need to upgrade my client too now that a fix is available.
ID: 67780 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,819,420
RAC: 19,777
Message 67833 - Posted: 18 Jan 2023, 9:28:09 UTC

It has been noted by CPDN that some machines download huge numbers of tasks (deliberately or otherwise) and they are looking at imposing limits server-side.

This should definitely be done. There're some machines that got so much work that they'll miss the year long deadline for Hadley models!
ID: 67833 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,676,976
RAC: 61,408
Message 68070 - Posted: 27 Jan 2023, 2:36:52 UTC - in response to Message 67761.  
Last modified: 27 Jan 2023, 3:30:29 UTC

For people that downloaded the version in PPA (7.20.5), have you run into the issue of boinc not fetching tasks even when the host is idle?

My general setup is:
CPDN, share 1000, max concurrent < total CPU.
Universe, WCG, Asteriods, LHC, share 100 or so, max concurrent for LHC on some hosts.

When the issue happens, if I manually trigger a project update for non-CPDN projects, they would all refuse to fetch work due to not being top priority for CPU project, even when there was no WU running. I have to set no new work for CPDN before other projects would fetch during updates.

For the two hosts that hit the problem, I set min buffer days to 0.3 and additional buffer days to 0.2 for one but only a min buffer day of 0.01 with no additional buffer for the other. My shares for projects haven't changed for a few weeks and this only happened on 7.20.5 twice so far. However, I am only running 7.20.5 at this point, so I am not sure it's really the newer version to blame. Gonna downgrade some host to see if I catch the same problem again. Curious if anyone else have seen similar problems.

Edit: Changed my mind since I can't stay on old version forever. Will just enable the debug flags instead. Is work_fetch_debug the right flag, or do I also need priority_debug and cpu_sched_debug?
ID: 68070 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,676,976
RAC: 61,408
Message 68072 - Posted: 27 Jan 2023, 3:46:31 UTC - in response to Message 68070.  
Last modified: 27 Jan 2023, 4:13:11 UTC

Well, I probably should have debugged this first. <work_fetch_debug> did the trick and this likely has nothing to do with version. For the top priority project that was picked, I saw this repeatedly showing up in every fetch cycle.

[work_fetch] deferring work fetch; upload active

I was able to observe the same issue happening with LHC when it's slowly uploading after a batch of tasks finish. Turns out if the current project picked for fetching is constantly uploading, the fetch will just be deferred and boinc won't try the next project. It's the right behavior if we assume upload should be quick, but sometimes uploading is going to take forever... Guess I need to build up a job cache big enough for the entire period of uploads just in case the project uploading ends up getting picked for fetching.
ID: 68072 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 68077 - Posted: 27 Jan 2023, 10:49:05 UTC - in response to Message 68072.  

Well, I probably should have debugged this first. <work_fetch_debug> did the trick and this likely has nothing to do with version. For the top priority project that was picked, I saw this repeatedly showing up in every fetch cycle.

[work_fetch] deferring work fetch; upload active

I was able to observe the same issue happening with LHC when it's slowly uploading after a batch of tasks finish. Turns out if the current project picked for fetching is constantly uploading, the fetch will just be deferred and boinc won't try the next project. It's the right behavior if we assume upload should be quick, but sometimes uploading is going to take forever... Guess I need to build up a job cache big enough for the entire period of uploads just in case the project uploading ends up getting picked for fetching.


I must have been lucky and done a work fetch during a period of project backoff. During the several periods of upload problems I’ve been receiving tasks as normal.
ID: 68077 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,004,017
RAC: 21,574
Message 68078 - Posted: 27 Jan 2023, 12:04:10 UTC - in response to Message 68077.  

I must have been lucky and done a work fetch during a period of project backoff. During the several periods of upload problems I’ve been receiving tasks as normal.
I have had no problems downloading testing branch tasks while my bored band is smoking with 9 tasks worth of zips to slowly clear.
ID: 68078 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68079 - Posted: 27 Jan 2023, 13:02:34 UTC - in response to Message 67833.  

It has been noted by CPDN that some machines download huge numbers of tasks (deliberately or otherwise) and they are looking at imposing limits server-side.
This should definitely be done. There're some machines that got so much work that they'll miss the year long deadline for Hadley models!
Yes, it's actively being tested on the CPDN dev site. From bugs or deliberate, this caused alot of lost tasks from the latest big batches. Plus we have the extra headache that the client doesn't respect the memory_bound set for the job and will start up as many tasks as free cpus (without any app_config). That causes alot of problems with low memory machines. As we go to higher memory tasks, server controls are the only viable option, with project preference controls.
ID: 68079 · Report as offensive     Reply Quote

Message boards : Number crunching : One of my computers is hoarding tasks and I don't know why

©2024 cpdn.org