Thread 'The uploads are stuck'

Author	Message
Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 67078 - Posted: 28 Dec 2022, 11:01:40 UTC - in response to Message 67067. We are just running normal BOINC. By server I mean the BOINC Server that the Projects run. I still do not know what you are talking about. 1.) What is "normal BOINC"? 2.) What do you mean by "the BOINC Server that the Projects run"? Very few Boinc users would be running a Boinc server, and those that do would presumably know all about it. Why would a normal user run a Boinc server? ID: 67078 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,403,463 RAC: 2,263	Message 67079 - Posted: 28 Dec 2022, 11:15:04 UTC Last modified: 28 Dec 2022, 11:18:56 UTC You dont need to know what I am on about, as the question is not for you. If I had the exact reply from one of the core BOINC developers, I would obviously post it. Like I said he confirmed there was a 100GB limit, and that it should be resolved in the next "code release". ID: 67079 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,717,389 RAC: 8,111	Message 67080 - Posted: 28 Dec 2022, 11:18:09 UTC BOINC has one limit you can't easily get round. Once you have "too many uploads pending", it won't ask for any more. That limit is a count, nothing to do with size, and it actually counts completed tasks only - so the massive backlog of individual files won't get in the way for a longish time. The disk usage limits are configurable, but I think one of them has an outdated hard cap of 100 GB if you say "no limit" - that's the one they're going to change, so 'unlimited' really means what it says. CPDN severs tend to fill up more quickly than other projects simply because weather data files are big, and as you increase the resolution, they get even bigger. And these new IFS tasks are higher resolution and faster ... But just at the moment, that's the least of our problems, because they got a big empty server, and nothing is getting through to it. ID: 67080 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,717,389 RAC: 8,111	Message 67081 - Posted: 28 Dec 2022, 11:24:25 UTC - in response to Message 67079. If I had the exact reply from one of the core BOINC developers, I would obviously post it. https://github.com/BOINC/boinc/issues/4643#issuecomment-1049738451 https://github.com/BOINC/boinc/pull/4923 ID: 67081 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,885,708 RAC: 18,983	Message 67082 - Posted: 28 Dec 2022, 11:29:20 UTC - in response to Message 67077. I may have found something. It seem that the total disk space limit set in BOINC is further subdivided and limits are set per project based on some algorithm that uses resource share as a variable. ncoded.com, If you're wiling to try this, log into your CPDN account and increase the resource share for CPDN, perhaps significantly (for the correct location), update CPDN in BOINC manager and see if your issue goes away. ID: 67082 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,403,463 RAC: 2,263	Message 67083 - Posted: 28 Dec 2022, 11:31:31 UTC Last modified: 28 Dec 2022, 11:43:37 UTC AndreyOR, The resource share was already 899 but I increased it to 955 and did an update in BOINC. I just need to wait now until I can request new work, and then I'll try and go over the 100GB. Thank you Richard at least that shows that it is a known issue. Also thank you to everyone that has tried to help me get this resolved. ID: 67083 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,885,708 RAC: 18,983	Message 67085 - Posted: 28 Dec 2022, 11:59:51 UTC - in response to Message 67083. The resource share may not depend on the value itself (955) but on the percentage share of CPDN compared to other projects. That's why I mentioned that the change may need to be significant. If your initial change doesn't work try something drastic, like make CPDN something like 90+% share and see if that makes a difference. This is just a theory and I haven't tested it myself but I think I did see evidence that disk space is subdivided among projects based on resource share. It's generally assumed that resource share affects only CPU time but it might affect other things, like disk space, RAM, upload priority, etc. ID: 67085 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,403,463 RAC: 2,263	Message 67086 - Posted: 28 Dec 2022, 12:05:52 UTC Last modified: 28 Dec 2022, 12:21:06 UTC CPDN is the only project that I have added to this server so surely that would negate any priority/resource share change effects? Also I think the resource limit is 999, so 955 is a pretty high % already (96%) ID: 67086 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,403,463 RAC: 2,263	Message 67088 - Posted: 28 Dec 2022, 12:26:48 UTC Last modified: 28 Dec 2022, 12:40:52 UTC Richard, it says in the Ticket: https://github.com/BOINC/boinc/issues/4643#issuecomment-1049738451 Workaround for older clients: Don't leave disk_max_used_gb and disk_max_used_pct at "0". Instead use higher limits. Do you think that if no value is set then this would default to: 0? ID: 67088 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,885,708 RAC: 18,983	Message 67089 - Posted: 28 Dec 2022, 12:35:02 UTC - in response to Message 67086. So CPDN is the only project you have on that computer? If so, the resource share shouldn't make a difference as it'd be 100% no matter what value is used 1 or 999. Based on what Richard posted and those links, the only thing I can think of trying, if you haven't yet, is to make sure that each of the 3 settings for disk usage is independently set to give you over 100GB as BOINC will use the most restrictive one of the 3. ID: 67089 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,403,463 RAC: 2,263	Message 67090 - Posted: 28 Dec 2022, 12:37:24 UTC Last modified: 28 Dec 2022, 12:40:18 UTC Yeah I just changed those settings, just incase the 'options' default to zero (rather than empty). ID: 67090 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,717,389 RAC: 8,111	Message 67091 - Posted: 28 Dec 2022, 12:42:03 UTC - in response to Message 67088. Richard, it says in the Ticket: https://github.com/BOINC/boinc/issues/4643#issuecomment-1049738451 Workaround for older clients: Don't leave disk_max_used_gb and disk_max_used_pct at "0". Instead use higher limits. Do you think that if no value is set then this would be considered as 0? I'm away from my main machines at the moment, so I can't check. But I would think so, yes. On this laptop, I have <disk_max_used_pct> 90, but I don't run any big projects on it, and at the moment it's completely idle to save electricity. ID: 67091 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,403,463 RAC: 2,263	Message 67092 - Posted: 28 Dec 2022, 12:53:14 UTC Last modified: 28 Dec 2022, 12:56:54 UTC Hopefully setting all 3 storage options with a value greater than 0 (and not an empty value) will provide a temp fix; although the ticket in question is a few months old. I'll update later tonight on any changes. Thanks again. ID: 67092 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,717,389 RAC: 8,111	Message 67093 - Posted: 28 Dec 2022, 13:06:54 UTC - in response to Message 67092. I'm heading home this afternoon, so I may be able to try out the effect of various changes on machines where it'll make a difference this evening. Code changes are cumulative, so the age of any particular ticket doesn't matter: the question is, when and how best to get hold of a working copy that includes the patch. We're expecting v7.22.x 'Real Soon Now', and have been for a couple of months: alternatively, I can guide you how to download one of the automated test programs that are used for checking changes as we go along, or you can compile your own from the master source code, as Dave does. ID: 67093 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 67094 - Posted: 28 Dec 2022, 13:26:16 UTC or you can compile your own from the master source code, as Dave does. I must admit to not having checked to see whether that particular issue is fixed in the latest Master. A couple of tasks have crashed so I have reduced the number I am crunching. It will take at the current rate of progress another three days before I can see if I have that limit or not but given how long it will take to clear the backlog, I don't intend to suspend network activity long enough to let things build up! ID: 67094 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,717,389 RAC: 8,111	Message 67095 - Posted: 28 Dec 2022, 13:50:18 UTC - in response to Message 67094. or you can compile your own from the master source code, as Dave does. I must admit to not having checked to see whether that particular issue is fixed in the latest Master. Look back at the second link I posted - the actual fix, rather than the exploration of the problem. It says 'merged' and 'closed', so that's when it reached master. But it won't have reached any of the release branches yet. ID: 67095 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 67096 - Posted: 28 Dec 2022, 14:07:25 UTC - in response to Message 67048. I know and I feel you. The non-math projects have been dwindling over the years. WCG used to cover a whole lot more but these days are just two medical projects with ARP occasionally trickling in. The migration off IBM certainly didn't go well. The projects I added in recent years (asteroid, universe, LHC) are all because at some point, all projects I contributed to run out of work. Among the long list of math projects, I have yet found anything I can remotely relate to. In addition, for winter, I'd rather run my computers than turning on the heater. Still though, BOINC or any projects are generally not run as a high availability service. That requires a level of funding and expertise that are generally not available to researchers and that's also a very different focus compared to science research. Sure we contribute compute power at our own cost, but I personally don't consider that enough to justify expecting people to troubleshoot during holidays. Totally agree. Especially about the heating value of desktop and workstation computers, with the recent cold snap here in North America (47 N latitude here). The less gas I need to burn to keep my home above 18C, the better. My local electric supply is mostly old-time safe fission nuke and reasonably cheap. keep on crunching. E ID: 67096 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 67097 - Posted: 28 Dec 2022, 14:07:26 UTC Last modified: 28 Dec 2022, 17:55:35 UTC Look back at the second link I posted - the actual fix, rather than the exploration of the problem. It says 'merged' and 'closed', so that's when it reached master. But it won't have reached any of the release branches yet. Thanks Richard, I have downloaded new masters at least twice since then so unless the patch doesn't work which is probably unlikely, I am not going to get the problem. Edit: Actually, quite a bit longer till I would reach100GB of CPDN data because of the files deleted once a task finishes and switches to <uploading> ID: 67097 · Reply Quote

ncoded.com Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,403,463 RAC: 2,263	Message 67099 - Posted: 28 Dec 2022, 15:17:11 UTC Last modified: 28 Dec 2022, 16:04:12 UTC After adding non-zero/non-empty values to each of the 3 disk usage options in Preferences, the Server then (shortly afterwards) downloaded and started running 22 new tasks. The current disk usage by CPDN is now at 110.55 GB, so that does seem to have resolved the issue. Apologies to Paolo for hijacking their posting. ID: 67099 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,000,348 RAC: 68,683	Message 67101 - Posted: 28 Dec 2022, 20:27:48 UTC - in response to Message 67064. And we can't shut down the server incase tasks fail when restarted after the reboot. I've run into similar issues before and the culprit is that disk IO being too slow when all tasks start at the same time after reboot. If IO times out, the tasks error out. This is especially painful for projects like CPDN that have a lot of data to load from disk when it starts. The more tasks you run relative to the speed of disk, the more likely it would happen. It shouldn't be a problem for finished tasks though. If you've depleted the work already anyway, restart shouldn't cause any completed task to fail based on my experience. I ended up changing my systemd unit file to add a `PreStart` that sets `max_ncpus_pct` to a low number and then a `ExecStartPost` script to slowly increase `max_ncpus_pct` over next minute. This has resolved all my reboot error problems. However, this won't cover suspend/resume but that never happens for many tasks at once in my setup. The unit file came with my distro also set `IOSchedulingClass` to `idle` and `IOSchedulingPriority` was not set. If your host is dedicated to BOINC, tuning that might help getting more bandwidth from disk, at the expense of other processes on the host. https://www.freedesktop.org/software/systemd/man/systemd.exec.html#IOSchedulingClass= Ideally boinc-client itself should understand how fast disk read is and stage the start of tasks to cover any scenario when it needs to read from or flush to disks in large volume. ID: 67101 · Reply Quote