Message boards : Number crunching : The uploads are stuck
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 25 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
We are just running normal BOINC. By server I mean the BOINC Server that the Projects run. I still do not know what you are talking about. 1.) What is "normal BOINC"? 2.) What do you mean by "the BOINC Server that the Projects run"? Very few Boinc users would be running a Boinc server, and those that do would presumably know all about it. Why would a normal user run a Boinc server? |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038 |
You dont need to know what I am on about, as the question is not for you. If I had the exact reply from one of the core BOINC developers, I would obviously post it. Like I said he confirmed there was a 100GB limit, and that it should be resolved in the next "code release". |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,729,836 RAC: 7,099 |
BOINC has one limit you can't easily get round. Once you have "too many uploads pending", it won't ask for any more. That limit is a count, nothing to do with size, and it actually counts completed tasks only - so the massive backlog of individual files won't get in the way for a longish time. The disk usage limits are configurable, but I think one of them has an outdated hard cap of 100 GB if you say "no limit" - that's the one they're going to change, so 'unlimited' really means what it says. CPDN severs tend to fill up more quickly than other projects simply because weather data files are big, and as you increase the resolution, they get even bigger. And these new IFS tasks are higher resolution and faster ... But just at the moment, that's the least of our problems, because they got a big empty server, and nothing is getting through to it. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,729,836 RAC: 7,099 |
If I had the exact reply from one of the core BOINC developers, I would obviously post it.https://github.com/BOINC/boinc/issues/4643#issuecomment-1049738451 https://github.com/BOINC/boinc/pull/4923 |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,913,871 RAC: 16,233 |
I may have found something. It seem that the total disk space limit set in BOINC is further subdivided and limits are set per project based on some algorithm that uses resource share as a variable. ncoded.com, If you're wiling to try this, log into your CPDN account and increase the resource share for CPDN, perhaps significantly (for the correct location), update CPDN in BOINC manager and see if your issue goes away. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038 |
AndreyOR, The resource share was already 899 but I increased it to 955 and did an update in BOINC. I just need to wait now until I can request new work, and then I'll try and go over the 100GB. Thank you Richard at least that shows that it is a known issue. Also thank you to everyone that has tried to help me get this resolved. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,913,871 RAC: 16,233 |
The resource share may not depend on the value itself (955) but on the percentage share of CPDN compared to other projects. That's why I mentioned that the change may need to be significant. If your initial change doesn't work try something drastic, like make CPDN something like 90+% share and see if that makes a difference. This is just a theory and I haven't tested it myself but I think I did see evidence that disk space is subdivided among projects based on resource share. It's generally assumed that resource share affects only CPU time but it might affect other things, like disk space, RAM, upload priority, etc. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038 |
CPDN is the only project that I have added to this server so surely that would negate any priority/resource share change effects? Also I think the resource limit is 999, so 955 is a pretty high % already (96%) |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038 |
Richard, it says in the Ticket: https://github.com/BOINC/boinc/issues/4643#issuecomment-1049738451 Workaround for older clients: Don't leave disk_max_used_gb and disk_max_used_pct at "0". Instead use higher limits. Do you think that if no value is set then this would default to: 0? |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,913,871 RAC: 16,233 |
So CPDN is the only project you have on that computer? If so, the resource share shouldn't make a difference as it'd be 100% no matter what value is used 1 or 999. Based on what Richard posted and those links, the only thing I can think of trying, if you haven't yet, is to make sure that each of the 3 settings for disk usage is independently set to give you over 100GB as BOINC will use the most restrictive one of the 3. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038 |
Yeah I just changed those settings, just incase the 'options' default to zero (rather than empty). |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,729,836 RAC: 7,099 |
Richard, it says in the Ticket:I'm away from my main machines at the moment, so I can't check. But I would think so, yes. On this laptop, I have <disk_max_used_pct> 90, but I don't run any big projects on it, and at the moment it's completely idle to save electricity. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038 |
Hopefully setting all 3 storage options with a value greater than 0 (and not an empty value) will provide a temp fix; although the ticket in question is a few months old. I'll update later tonight on any changes. Thanks again. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,729,836 RAC: 7,099 |
I'm heading home this afternoon, so I may be able to try out the effect of various changes on machines where it'll make a difference this evening. Code changes are cumulative, so the age of any particular ticket doesn't matter: the question is, when and how best to get hold of a working copy that includes the patch. We're expecting v7.22.x 'Real Soon Now', and have been for a couple of months: alternatively, I can guide you how to download one of the automated test programs that are used for checking changes as we go along, or you can compile your own from the master source code, as Dave does. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
or you can compile your own from the master source code, as Dave does.I must admit to not having checked to see whether that particular issue is fixed in the latest Master. A couple of tasks have crashed so I have reduced the number I am crunching. It will take at the current rate of progress another three days before I can see if I have that limit or not but given how long it will take to clear the backlog, I don't intend to suspend network activity long enough to let things build up! |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,729,836 RAC: 7,099 |
Look back at the second link I posted - the actual fix, rather than the exploration of the problem. It says 'merged' and 'closed', so that's when it reached master. But it won't have reached any of the release branches yet.or you can compile your own from the master source code, as Dave does.I must admit to not having checked to see whether that particular issue is fixed in the latest Master. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I know and I feel you. The non-math projects have been dwindling over the years. WCG used to cover a whole lot more but these days are just two medical projects with ARP occasionally trickling in. The migration off IBM certainly didn't go well. The projects I added in recent years (asteroid, universe, LHC) are all because at some point, all projects I contributed to run out of work. Among the long list of math projects, I have yet found anything I can remotely relate to. In addition, for winter, I'd rather run my computers than turning on the heater. Totally agree. Especially about the heating value of desktop and workstation computers, with the recent cold snap here in North America (47 N latitude here). The less gas I need to burn to keep my home above 18C, the better. My local electric supply is mostly old-time safe fission nuke and reasonably cheap. keep on crunching. E |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Look back at the second link I posted - the actual fix, rather than the exploration of the problem. It says 'merged' and 'closed', so that's when it reached master. But it won't have reached any of the release branches yet.Thanks Richard, I have downloaded new masters at least twice since then so unless the patch doesn't work which is probably unlikely, I am not going to get the problem. Edit: Actually, quite a bit longer till I would reach100GB of CPDN data because of the files deleted once a task finishes and switches to <uploading> |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,408,433 RAC: 2,038 |
After adding non-zero/non-empty values to each of the 3 disk usage options in Preferences, the Server then (shortly afterwards) downloaded and started running 22 new tasks. The current disk usage by CPDN is now at 110.55 GB, so that does seem to have resolved the issue. Apologies to Paolo for hijacking their posting. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,267,135 RAC: 73,190 |
And we can't shut down the server incase tasks fail when restarted after the reboot. I've run into similar issues before and the culprit is that disk IO being too slow when all tasks start at the same time after reboot. If IO times out, the tasks error out. This is especially painful for projects like CPDN that have a lot of data to load from disk when it starts. The more tasks you run relative to the speed of disk, the more likely it would happen. It shouldn't be a problem for finished tasks though. If you've depleted the work already anyway, restart shouldn't cause any completed task to fail based on my experience. I ended up changing my systemd unit file to add a `PreStart` that sets `max_ncpus_pct` to a low number and then a `ExecStartPost` script to slowly increase `max_ncpus_pct` over next minute. This has resolved all my reboot error problems. However, this won't cover suspend/resume but that never happens for many tasks at once in my setup. The unit file came with my distro also set `IOSchedulingClass` to `idle` and `IOSchedulingPriority` was not set. If your host is dedicated to BOINC, tuning that might help getting more bandwidth from disk, at the expense of other processes on the host. https://www.freedesktop.org/software/systemd/man/systemd.exec.html#IOSchedulingClass= Ideally boinc-client itself should understand how fast disk read is and stage the start of tasks to cover any scenario when it needs to read from or flush to disks in large volume. |
©2024 cpdn.org