climateprediction.net (CPDN) home page
Thread 'Tasks available, but I am not getting them.'

Thread 'Tasks available, but I am not getting them.'

Message boards : Number crunching : Tasks available, but I am not getting them.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 71185 - Posted: 5 Aug 2024, 0:58:04 UTC
Last modified: 5 Aug 2024, 1:22:58 UTC

I'm assuming every time you made changes you updated the project (if website changes) and/or read the config. files (if local changes)

Related to what Glenn said, it doesn't look like you've run benchmarks on BOINC. That may have contributed to BOINC not getting work as it calculated that it couldn't complete the work or it was more than what your cache size was. Especially if it's the first time getting the very large CPDN workunits with this VM instance.
ID: 71185 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 71187 - Posted: 5 Aug 2024, 5:51:44 UTC - in response to Message 71185.  

Related to what Glenn said, it doesn't look like you've run benchmarks on BOINC.


Good point. Estimated time to completion using WINE is way out before running benchmarks. On my Ryzen9 BOINC was estimating something like 25 days to completion here so almost 5 times what they are actually taking. A computer would not have to be horrendously slow for it to think tasks would not complete.

Unfortunately, there isn't an option for WINE to identify itself as such so we can't get a clue as to how many machines are using it from the credit statistics pages. I do notice that the most up to date version does let one pretend to be using Windows11 which wasn't available last time I checked on my old machine.
ID: 71187 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 71188 - Posted: 5 Aug 2024, 7:31:00 UTC

Just to cover my bases I suspended everything on my host machine and ran the benchmarks on the VM.
It's done now but still not grabbing any more work which makes sense given the expected completion for the tasks I have is still 38 days away. That's coming down several times faster than the time expired is going up, it will just take a while to level itself out I suspect.
ID: 71188 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 71189 - Posted: 5 Aug 2024, 7:31:57 UTC

In case anyone is wondering, I've been careful to not overcommit the machine and keep a couple of cores free no matter what I'm crunching.
ID: 71189 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 71190 - Posted: 5 Aug 2024, 7:53:35 UTC - in response to Message 71189.  
Last modified: 5 Aug 2024, 7:56:07 UTC

In case anyone is wondering, I've been careful to not overcommit the machine and keep a couple of cores free no matter what I'm crunching.

I always keep some spare too. Going into virtual cores seems not to gain anything with CPDN though there are projects where it does seem to help. I also limit the number of real cores because of temperature considerations.
ID: 71190 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 71191 - Posted: 5 Aug 2024, 8:07:39 UTC - in response to Message 71190.  

Presumably the key metric is "how many hardware Floating Point maths units does your chip have?"

Hyperthreading a single core won't get you an extra FPU: and some chip designs share one FPU between two cores to save money. They can get away with that if you're just surfing the web, or processing words: but real heavy-duty scientific maths needs the real McCoy.
ID: 71191 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71192 - Posted: 5 Aug 2024, 8:34:59 UTC

Ok, so it looks like the lack of running benchmarks and setting the 'store at least' and 'an additional' client settings are the key to getting the tasks on the new VM.

But what I don't understand is why the client doesn't put a message in the log that a project's task would exceed the 'store' settings? (in the same way it does for lack of disk space for example).

The other thing that puzzles me is why it continues to get work when I drop the settings down to 1 day / 0.01 day once a task has completed. By the apparent logic of the lack of tasks issue this shouldn't be the case.
---
CPDN Visiting Scientist
ID: 71192 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 71194 - Posted: 5 Aug 2024, 9:00:56 UTC - in response to Message 71192.  

I think the answer will need serious analysis, but will probably boil down to BOINC (client and server together) failing to make reasonable assumptions in edge cases, like new app versions, new hardware, and non-standard configurations like Wine or VMs.

One particular issue that I might explore is the speed estimate of a newly installed Wine/BOINC combo. Why is no benchmark run at first startup, and what speed estimate does BOINC supply in the absence of a benchmark? I suspect it's one of David's magic numbers: my memory is playing around with figures like 1 MHz, but it might be 1 GHz. Some of that can be explored in the community by users, but some of it might require sight of the server logs of the work request and allocation decisions - Einstein provides handy access to those for their project apps.
ID: 71194 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 71195 - Posted: 5 Aug 2024, 9:20:09 UTC
Last modified: 5 Aug 2024, 9:25:05 UTC

my memory is playing around with figures like 1 MHz, but it might be 1 GHz.
1GHz would fit reasonably well with the initial estimate being off by a factor of 5 or a tad under on my new machine.

What I have not determined is why in a VM it is only out by a factor of between two and three. I know George has noted that the initial guess without running benchmarks is much more pessimistic with WINE than either a native Window or Linux version of BOINC. I have done some playing and there is no difference between which version of Windows WINE reports to the server.
ID: 71195 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 71196 - Posted: 5 Aug 2024, 9:39:36 UTC - in response to Message 71195.  

1) Yes, 1 GHz is right. From cs_benchmark.cpp#L69:

// defaults in case benchmarks fail or time out.
// better to err on the low side so hosts don't get too much work

#define DEFAULT_FPOPS   1e9
#define DEFAULT_IOPS    1e9
#define DEFAULT_MEMBW   1e9
#define DEFAULT_CACHE   1e6
2) On my way to that, I found discussion #5128. Benchmark results are very sensitive to the compiler optimisation settings set for client compilation.
ID: 71196 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 71197 - Posted: 5 Aug 2024, 13:23:25 UTC

The other thing that puzzles me is why it continues to get work when I drop the settings down to 1 day / 0.01 day once a task has completed. By the apparent logic of the lack of tasks issue this shouldn't be the case.

Mine is set to .5 + 0 and I still get a full complement of tasks (12 per app_config). They usually come in sets of 4. When tasks are close to completion (1/2 day or so left) I'll start getting new tasks. Maybe it's because .5(days) x 24(threads) = 12 days(of calculation time) but tasks usually take 10 so it fits?

Dark Angel,
We have the same CPU, so I'd guess your tasks will take about 10 days to complete. Seems like you already have a full complement (4). There's a limit of 4 per day initially so if you're going to get more it'll be the next day, if your cache is large enough. If you lowered your cache, you won't get any more until the current ones get close to completion. There are plenty of tasks in queue, I'd guess it'll be another 3 weeks before they're all gone.
ID: 71197 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71198 - Posted: 5 Aug 2024, 13:37:27 UTC - in response to Message 71197.  

The other thing that puzzles me is why it continues to get work when I drop the settings down to 1 day / 0.01 day once a task has completed. By the apparent logic of the lack of tasks issue this shouldn't be the case.
Mine is set to .5 + 0 and I still get a full complement of tasks (12 per app_config). They usually come in sets of 4. When tasks are close to completion (1/2 day or so left) I'll start getting new tasks. Maybe it's because .5(days) x 24(threads) = 12 days(of calculation time) but tasks usually take 10 so it fits?
True, but each task is single threaded so I don't see how 'x24' threads makes any difference. If anything, running 24 cpus/threads simultaneously will increase the task completion time as the machine is more heavily loaded. It's as if BOINC is using different logic the first time a client asks for a task from a project. I'm not going to look into the code but this is one part of boinc that is hard to fathom (for me anyway...).
---
CPDN Visiting Scientist
ID: 71198 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 71201 - Posted: 5 Aug 2024, 20:49:26 UTC - in response to Message 71198.  

The reason I was thinking it may be calculating it like that is that from a message 71181 above, the cache is 5+5 but the shortfall is 40 (4 cpus).

With 1 GLOPS capability (before benchmarks), it'd take ~44 days to complete a task but the shortfall was 40 (and cache 10). Expanding cache to 10+10 allowed 44 days to fit. With the deadline being 70 it wasn't a factor in this case. It kind of seems like the number of cpus/threads doesn't really matter or that BOINC applies the shortfall to each cpu. But who knows.

I do wonder though if benchmarks were run before requesting work, if work would've been sent. I kind of want to say it would have been. I feel like benchmarks was the key here especially since CPDN tasks are generally very long compared to a typical BOINC project.

Looking at it that way kind of makes sense to me but I do agree with you that how BOINC determines the amount of work it requests can be puzzling. I've seen it do some hard to fathom things too. It also seems like BOINC doesn't run benchmarks by default right after installation which is a bit strange.
ID: 71201 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 71202 - Posted: 5 Aug 2024, 20:50:23 UTC

I suspect it might also matter what other projects are running on the CPU at the time. Currently I'm working through a cache of WCG and Einstein (on GPU) work on the host with work limits set in app_config to keep threads free for the VM. WCG is less demanding on the system than say LHC work, especially for RAM, disk, and network I/O. Milkyway used to be very hard on the FPU due to it's need for double precision. Asteroids isn't memory intensive but it uses the latest CPU extensions.
ID: 71202 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 71203 - Posted: 5 Aug 2024, 21:06:15 UTC - in response to Message 71202.  

BOINC on VM is independent from the host one. From the work fetch log you posted, # of idle cpus was 4 (which is the amount of cpus you assigned to this VM), saturated and busy were both 0 so the VM BOINC was doing nothing. It's up to you to make sure the VM has sufficient computing resources otherwise you can overload your system and things would slow down to a crawl on both host and VM (it's happened to me). BOINC client installations are independent, even if managed via the same manager.
ID: 71203 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 71204 - Posted: 5 Aug 2024, 21:33:01 UTC - in response to Message 71203.  

Yes, the installations are independent but they're still using the same CPU resources so you (I) have to make sure to not allow both Boinc installs to overlap in what they're using. I've restricted my host Boinc from using the cores allocated to the VM plus a couple of spares as a buffer.
I have my host Boinc installed on it's own drive and my VMs reside on a different physical drive. Both are separate from the host OS drive. This is very deliberate to prevent both I/O bottlenecks as well as to prevent my Boinc install from wearing out my nvme host drive from excessive writes (problem when running LHC CMS work in particular. I also run a network proxy to mitigate the amount of data LHC downloads for each work unit when I'm running that project).
ID: 71204 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71205 - Posted: 5 Aug 2024, 23:38:41 UTC - in response to Message 71201.  

I do wonder though if benchmarks were run before requesting work, if work would've been sent. I kind of want to say it would have been. I feel like benchmarks was the key here especially since CPDN tasks are generally very long compared to a typical BOINC project
Your explanation makes sense. I thought the client always runs a benchmark whenever it starts up though? That seems the root problem if they didn't run.
---
CPDN Visiting Scientist
ID: 71205 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 71206 - Posted: 6 Aug 2024, 9:22:11 UTC - in response to Message 71205.  

Your explanation makes sense. I thought the client always runs a benchmark whenever it starts up though? That seems the root problem if they didn't run.

I think Richard wondered about that too. I think a newly installed client definitely should but I don't think it does always or if it does the run fails without error messages. I've seen it re-run benchmarks periodically, but not sure what the period is.
ID: 71206 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 71207 - Posted: 6 Aug 2024, 10:00:51 UTC - in response to Message 71206.  

Hang on a moment - this thread is getting very muddled. Where did we get the idea that benchmarks hadn't run from? And whose account, which host, were we talking about at the time?

It seems to have started with AndreyOR's message 71185 - but that wasn't replying to a specific prior post, and the reference to 'you' is ambiguous.

I'm assuming that the reference was to Dark Angel, who is showing two computers on his account:

Linux host 1534740
Windows host 1548438 - further assumed to be running in a VM on the Linux machine, per message 71172

As I type now, both hosts are showing a normal 'measured speed' - Linux 3.14GHz, Windows 5.73GHz (*)

Where exactly did the 'no benchmark' idea come from?

* the difference in speed between the two instances is itself interesting, but for another thread.
ID: 71207 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 71208 - Posted: 6 Aug 2024, 10:59:22 UTC - in response to Message 71204.  

I have my host Boinc installed on it's own drive and my VMs reside on a different physical drive. Both are separate from the host OS drive. This is very deliberate to prevent both I/O bottlenecks as well as to prevent my Boinc install from wearing out my nvme host drive from excessive writes (problem when running LHC CMS work in particular. I also run a network proxy to mitigate the amount of data LHC downloads for each work unit when I'm running that project).

I've run LHC for a while before and likely will come back to it. Mostly ATLAS, some Theory, did try a little bit of CMS but didn't do much of it mostly because it's not available native. Definitely hardest to set up project and that's before Squid. I ran it on WLS2 and that has its own quirks. Probably most demanding project when it comes to resource use too.

I think modern SSDs are pretty durable and will last years even with heavy use. As long as they're protected from overheating, power surges and frequent power failures, they'll likely endure heavy use for quite a long time.
ID: 71208 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Tasks available, but I am not getting them.

©2024 cpdn.org