climateprediction.net (CPDN) home page
Thread 'The uploads are stuck'

Thread 'The uploads are stuck'

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 25 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67078 - Posted: 28 Dec 2022, 11:01:40 UTC - in response to Message 67067.  

We are just running normal BOINC. By server I mean the BOINC Server that the Projects run.


I still do not know what you are talking about.

1.) What is "normal BOINC"?

2.) What do you mean by "the BOINC Server that the Projects run"? Very few Boinc users would be running a Boinc server, and those that do would presumably know all about it. Why would a normal user run a Boinc server?
ID: 67078 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,408,433
RAC: 2,038
Message 67079 - Posted: 28 Dec 2022, 11:15:04 UTC
Last modified: 28 Dec 2022, 11:18:56 UTC

You dont need to know what I am on about, as the question is not for you.

If I had the exact reply from one of the core BOINC developers, I would obviously post it.

Like I said he confirmed there was a 100GB limit, and that it should be resolved in the next "code release".
ID: 67079 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,729,836
RAC: 7,099
Message 67080 - Posted: 28 Dec 2022, 11:18:09 UTC

BOINC has one limit you can't easily get round. Once you have "too many uploads pending", it won't ask for any more. That limit is a count, nothing to do with size, and it actually counts completed tasks only - so the massive backlog of individual files won't get in the way for a longish time.

The disk usage limits are configurable, but I think one of them has an outdated hard cap of 100 GB if you say "no limit" - that's the one they're going to change, so 'unlimited' really means what it says.

CPDN severs tend to fill up more quickly than other projects simply because weather data files are big, and as you increase the resolution, they get even bigger. And these new IFS tasks are higher resolution and faster ... But just at the moment, that's the least of our problems, because they got a big empty server, and nothing is getting through to it.
ID: 67080 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,729,836
RAC: 7,099
Message 67081 - Posted: 28 Dec 2022, 11:24:25 UTC - in response to Message 67079.  

If I had the exact reply from one of the core BOINC developers, I would obviously post it.
https://github.com/BOINC/boinc/issues/4643#issuecomment-1049738451

https://github.com/BOINC/boinc/pull/4923
ID: 67081 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,913,871
RAC: 16,233
Message 67082 - Posted: 28 Dec 2022, 11:29:20 UTC - in response to Message 67077.  

I may have found something. It seem that the total disk space limit set in BOINC is further subdivided and limits are set per project based on some algorithm that uses resource share as a variable.

ncoded.com,
If you're wiling to try this, log into your CPDN account and increase the resource share for CPDN, perhaps significantly (for the correct location), update CPDN in BOINC manager and see if your issue goes away.
ID: 67082 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,408,433
RAC: 2,038
Message 67083 - Posted: 28 Dec 2022, 11:31:31 UTC
Last modified: 28 Dec 2022, 11:43:37 UTC

AndreyOR,

The resource share was already 899 but I increased it to 955 and did an update in BOINC. I just need to wait now until I can request new work, and then I'll try and go over the 100GB.

Thank you Richard at least that shows that it is a known issue.

Also thank you to everyone that has tried to help me get this resolved.
ID: 67083 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,913,871
RAC: 16,233
Message 67085 - Posted: 28 Dec 2022, 11:59:51 UTC - in response to Message 67083.  

The resource share may not depend on the value itself (955) but on the percentage share of CPDN compared to other projects. That's why I mentioned that the change may need to be significant. If your initial change doesn't work try something drastic, like make CPDN something like 90+% share and see if that makes a difference.

This is just a theory and I haven't tested it myself but I think I did see evidence that disk space is subdivided among projects based on resource share. It's generally assumed that resource share affects only CPU time but it might affect other things, like disk space, RAM, upload priority, etc.
ID: 67085 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,408,433
RAC: 2,038
Message 67086 - Posted: 28 Dec 2022, 12:05:52 UTC
Last modified: 28 Dec 2022, 12:21:06 UTC

CPDN is the only project that I have added to this server so surely that would negate any priority/resource share change effects?

Also I think the resource limit is 999, so 955 is a pretty high % already (96%)
ID: 67086 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,408,433
RAC: 2,038
Message 67088 - Posted: 28 Dec 2022, 12:26:48 UTC
Last modified: 28 Dec 2022, 12:40:52 UTC

Richard, it says in the Ticket:

https://github.com/BOINC/boinc/issues/4643#issuecomment-1049738451

Workaround for older clients:
Don't leave disk_max_used_gb and disk_max_used_pct at "0".
Instead use higher limits.

Do you think that if no value is set then this would default to: 0?
ID: 67088 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,913,871
RAC: 16,233
Message 67089 - Posted: 28 Dec 2022, 12:35:02 UTC - in response to Message 67086.  

So CPDN is the only project you have on that computer? If so, the resource share shouldn't make a difference as it'd be 100% no matter what value is used 1 or 999.

Based on what Richard posted and those links, the only thing I can think of trying, if you haven't yet, is to make sure that each of the 3 settings for disk usage is independently set to give you over 100GB as BOINC will use the most restrictive one of the 3.
ID: 67089 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,408,433
RAC: 2,038
Message 67090 - Posted: 28 Dec 2022, 12:37:24 UTC
Last modified: 28 Dec 2022, 12:40:18 UTC

Yeah I just changed those settings, just incase the 'options' default to zero (rather than empty).
ID: 67090 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,729,836
RAC: 7,099
Message 67091 - Posted: 28 Dec 2022, 12:42:03 UTC - in response to Message 67088.  

Richard, it says in the Ticket:

https://github.com/BOINC/boinc/issues/4643#issuecomment-1049738451

Workaround for older clients:
Don't leave disk_max_used_gb and disk_max_used_pct at "0".
Instead use higher limits.

Do you think that if no value is set then this would be considered as 0?
I'm away from my main machines at the moment, so I can't check. But I would think so, yes.

On this laptop, I have <disk_max_used_pct> 90, but I don't run any big projects on it, and at the moment it's completely idle to save electricity.
ID: 67091 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,408,433
RAC: 2,038
Message 67092 - Posted: 28 Dec 2022, 12:53:14 UTC
Last modified: 28 Dec 2022, 12:56:54 UTC

Hopefully setting all 3 storage options with a value greater than 0 (and not an empty value) will provide a temp fix; although the ticket in question is a few months old.

I'll update later tonight on any changes.

Thanks again.
ID: 67092 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,729,836
RAC: 7,099
Message 67093 - Posted: 28 Dec 2022, 13:06:54 UTC - in response to Message 67092.  

I'm heading home this afternoon, so I may be able to try out the effect of various changes on machines where it'll make a difference this evening.

Code changes are cumulative, so the age of any particular ticket doesn't matter: the question is, when and how best to get hold of a working copy that includes the patch. We're expecting v7.22.x 'Real Soon Now', and have been for a couple of months: alternatively, I can guide you how to download one of the automated test programs that are used for checking changes as we go along, or you can compile your own from the master source code, as Dave does.
ID: 67093 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67094 - Posted: 28 Dec 2022, 13:26:16 UTC

or you can compile your own from the master source code, as Dave does.
I must admit to not having checked to see whether that particular issue is fixed in the latest Master. A couple of tasks have crashed so I have reduced the number I am crunching. It will take at the current rate of progress another three days before I can see if I have that limit or not but given how long it will take to clear the backlog, I don't intend to suspend network activity long enough to let things build up!
ID: 67094 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,729,836
RAC: 7,099
Message 67095 - Posted: 28 Dec 2022, 13:50:18 UTC - in response to Message 67094.  

or you can compile your own from the master source code, as Dave does.
I must admit to not having checked to see whether that particular issue is fixed in the latest Master.
Look back at the second link I posted - the actual fix, rather than the exploration of the problem. It says 'merged' and 'closed', so that's when it reached master. But it won't have reached any of the release branches yet.
ID: 67095 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 67096 - Posted: 28 Dec 2022, 14:07:25 UTC - in response to Message 67048.  

I know and I feel you. The non-math projects have been dwindling over the years. WCG used to cover a whole lot more but these days are just two medical projects with ARP occasionally trickling in. The migration off IBM certainly didn't go well. The projects I added in recent years (asteroid, universe, LHC) are all because at some point, all projects I contributed to run out of work. Among the long list of math projects, I have yet found anything I can remotely relate to. In addition, for winter, I'd rather run my computers than turning on the heater.

Still though, BOINC or any projects are generally not run as a high availability service. That requires a level of funding and expertise that are generally not available to researchers and that's also a very different focus compared to science research. Sure we contribute compute power at our own cost, but I personally don't consider that enough to justify expecting people to troubleshoot during holidays.


Totally agree.
Especially about the heating value of desktop and workstation computers, with the recent cold snap here in North America (47 N latitude here).
The less gas I need to burn to keep my home above 18C, the better. My local electric supply is mostly old-time safe fission nuke and reasonably cheap.

keep on crunching.

E
ID: 67096 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67097 - Posted: 28 Dec 2022, 14:07:26 UTC
Last modified: 28 Dec 2022, 17:55:35 UTC

Look back at the second link I posted - the actual fix, rather than the exploration of the problem. It says 'merged' and 'closed', so that's when it reached master. But it won't have reached any of the release branches yet.
Thanks Richard, I have downloaded new masters at least twice since then so unless the patch doesn't work which is probably unlikely, I am not going to get the problem.
Edit: Actually, quite a bit longer till I would reach100GB of CPDN data because of the files deleted once a task finishes and switches to <uploading>
ID: 67097 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,408,433
RAC: 2,038
Message 67099 - Posted: 28 Dec 2022, 15:17:11 UTC
Last modified: 28 Dec 2022, 16:04:12 UTC

After adding non-zero/non-empty values to each of the 3 disk usage options in Preferences, the Server then (shortly afterwards) downloaded and started running 22 new tasks.

The current disk usage by CPDN is now at 110.55 GB, so that does seem to have resolved the issue.

Apologies to Paolo for hijacking their posting.
ID: 67099 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,267,135
RAC: 73,190
Message 67101 - Posted: 28 Dec 2022, 20:27:48 UTC - in response to Message 67064.  

And we can't shut down the server incase tasks fail when restarted after the reboot.

I've run into similar issues before and the culprit is that disk IO being too slow when all tasks start at the same time after reboot. If IO times out, the tasks error out. This is especially painful for projects like CPDN that have a lot of data to load from disk when it starts. The more tasks you run relative to the speed of disk, the more likely it would happen. It shouldn't be a problem for finished tasks though. If you've depleted the work already anyway, restart shouldn't cause any completed task to fail based on my experience.

I ended up changing my systemd unit file to add a `PreStart` that sets `max_ncpus_pct` to a low number and then a `ExecStartPost` script to slowly increase `max_ncpus_pct` over next minute. This has resolved all my reboot error problems. However, this won't cover suspend/resume but that never happens for many tasks at once in my setup. The unit file came with my distro also set `IOSchedulingClass` to `idle` and `IOSchedulingPriority` was not set. If your host is dedicated to BOINC, tuning that might help getting more bandwidth from disk, at the expense of other processes on the host. https://www.freedesktop.org/software/systemd/man/systemd.exec.html#IOSchedulingClass=

Ideally boinc-client itself should understand how fast disk read is and stage the start of tasks to cover any scenario when it needs to read from or flush to disks in large volume.
ID: 67101 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 cpdn.org