Message boards : Number crunching : New Work Announcements 2024
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,029,695 RAC: 19,917 |
#1007 EASHA 6400 2024-02-15 WAH2 East Asia 25km 1986-2018 |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
v8.29 is much more stable than the old v8.24; for batch 1006 it's showing 7% task fails and only 9 hard fails out of 6044 workunits so far (a 'hard fail' is when all 3 attempted tasks fail). That is considerably less than the identical batch 1001; 121% and 1346 respectively. Excellent, that's far better results! Of those hard fails, are they still "code related crashes" (segfaults, failure to resume, etc), or are they things outside your control (AV rejection of the binary, world going impossible, looks-like-bad-hardware)? The linux version needs verifying against a Windows batch before we can deploy it to production. I'm always willing and able to throw Linux boxes (mostly AMD right now) at a problem! :) |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,445,768 RAC: 14,599 |
Of those hard fails, are they still "code related crashes" (segfaults, failure to resume, etc), or are they things outside your control (AV rejection of the binary, world going impossible, looks-like-bad-hardware)?I'm analysing the failures. CPDN have a process which looks at the output from each failed task and plots a nice histogram of each failure type. If it wasn't such a faff to include an image here I'd show it. About 30% of fails from the new app are due to AV quarantining when it tries to start. About 10-15% are other Windows related errors. Then it's download errors, user aborts etc. But about 40% are 'unclassified' which means we aren't able to easily determine what caused the task to fail judging from the log; could be our code, could be boinc, could be the machine. The 8.29 app is not producing any of the segmentation faults we saw before with the 8.24 app though, which is good. We should get a much more acceptable hard fail rate with the new app. There are at least 3 more EAS25 batches to come in the next couple of weeks. Plenty of time to have a look at its performance. --- CPDN Visiting Scientist |
Send message Joined: 11 Jan 22 Posts: 2 Credit: 2,382,635 RAC: 673 |
Hi, I'm usually a set and forget user that has rarely seen windows tasks and was just dumped a bunch of the EAS25 (had a few batch 1001 fail early on). It seems like the older 1001s have really slowed down in the last couple of days. Not sure if this is normal or if there are any configuration changes that would be a good idea. Happy to see that there is more coming out, just want to check if there is any suggestions for maximizing performance on this project. Right now I have 16 tasks from this project and 11 threads available for boinc, so at this moment 11 CPDN tasks are computing now that some urgent WCG tasks have finished. https://imgur.com/a/LWB3NAh Computer info: I7-12700k (8P cores active, with hyper-threading) 32GB ram 200GB dedicated SSD space (16GB in use) Simultaneously running FAH on two GPUs(using ~1 thread each) |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,029,695 RAC: 19,917 |
o at this moment 11 CPDN tasks are computing now that some urgent WCG tasks have finished.My experience is that on my 16 thread ryzen (8 real cores) going above running 8 tasks concurrently actually results in a reduction in overall throughput with CPDN tasks. (There are other projects however where going above the 8 real cores does scale in something close to a linear manner.) |
Send message Joined: 11 Jan 22 Posts: 2 Credit: 2,382,635 RAC: 673 |
Thanks. I also run a lot of WCG which works fine with full thread usage for my computer, so I'd rather not set the overall boinc CPU thread usage at half of what should be available. Is there a convenient way to set how many cores a project uses per task? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,445,768 RAC: 14,599 |
Additional workunits for batch 1007 are going out today. They were omitted from the original send due to a misconfiguration. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,029,695 RAC: 19,917 |
Is there a convenient way to set how many cores a project uses per task?I can't think of an easy way to do it off hand. By the way if ARP tasks come back with WCG, they also suffer in the same way if you start using virtual cores. |
Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080 |
Create app_config.xml in project directory Copy this there <app_config> <app> <name>wah2</name> <max_concurrent>4</max_concurrent> </app> </app_config> or <app_config> <project_max_concurrent>4</project_max_concurrent> </app_config> |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,445,768 RAC: 14,599 |
Not quite, there are two apps for Weather@Home. wah2 & wah2_ri, all the latest batches are using wah2_ri. You need two different <app> sections if you are going to use <app>. Also, you need to tell the client to 'Reread the config files' otherwise this won't take effect until the next time the client is started. CPDN models are very floating point intensive. Since a cpu core only has one set of floating point units, two threads have to compete for resource. That's why your throughput drops. Checkout this post https://www.cpdn.org/forum_thread.php?id=9184&postid=68081 on these forums for an illustration and more explanation. <app_config> <app> <name>wah2</name> <max_concurrent>4</max_concurrent> </app> <app> <name>wah2_ri</name> <max_concurrent>4</max_concurrent> </app> </app_config> --- CPDN Visiting Scientist |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
So is there any word on when further new work will drop? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,710,763 RAC: 8,968 |
Or, since all CPDN applications will be floating point intensive, and will all suffer from FPU congestion on a hyperthreaded CPU, you could use the single project-level tag instead: <project_max_concurrent>N</project_max_concurrent>For a full list of the available options, see the BOINC user manual. |
Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080 |
as of 21 Feb 2024, 10:06:32 UTC there were 1052 unsent wah tasks. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
as of 21 Feb 2024, 10:06:32 UTC there were 1052 unsent wah tasks. Yeah, I lit up a new VM to chew on a few of those. I don't think there's more than a day or two before they're drained out, though (and it's a new machine, so it's in the "task quota limit" period - but should get 'em chewed pretty fast with few tasks on a big CPU). There's always resend work for a while after the count goes to zero, though. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,029,695 RAC: 19,917 |
So is there any word on when further new work will drop?Server status currently showing 704 tasks ready to send, though doubtless that has dropped a bit since the last server update. I am guessing it may not be till next week that we get another of the batches that was mis configured sent out. The person who normally sends batches out is away and I don't know how much time Glenn has free to do this. If he doesn't have time it will have to wait till the person who normally does it is back. Edit, 704 was from the newest batch. there were also a few retreads from 1001. |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
So is there any word on when further new work will drop?Server status currently showing 704 tasks ready to send, though doubtless that has dropped a bit since the last server update. I am guessing it may not be till next week that we get another of the batches that was mis configured sent out. The person who normally sends batches out is away and I don't know how much time Glenn has free to do this. If he doesn't have time it will have to wait till the person who normally does it is back. For some reason it's not letting me have any. I upped the number of CPU cores and RAM in my VM last night to do more, extended my work cache settings, and freed up disk space, but it's still not giving me any more than the three I currently have. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
For some reason it's not letting me have any. What's your client log say about the reason it's not requesting new work? There's usually some obvious-ish reason listed. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,445,768 RAC: 14,599 |
I've been sending out the WaH2 EAS25 batches as soon as they are ready. The previous mis-configured batches are still being checked and aren't ready. Linux batches are not far away, again, still under test on the dev site.So is there any word on when further new work will drop?Server status currently showing 704 tasks ready to send, though doubtless that has dropped a bit since the last server update. I am guessing it may not be till next week that we get another of the batches that was mis configured sent out. The person who normally sends batches out is away and I don't know how much time Glenn has free to do this. If he doesn't have time it will have to wait till the person who normally does it is back. --- CPDN Visiting Scientist |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
For some reason it's not letting me have any. 22/02/2024 00:52:38 | climateprediction.net | Sending scheduler request: To fetch work. 22/02/2024 00:52:38 | climateprediction.net | Requesting new tasks for CPU 22/02/2024 00:52:41 | climateprediction.net | Scheduler request completed: got 0 new tasks 22/02/2024 00:52:41 | climateprediction.net | No tasks sent 22/02/2024 00:52:41 | climateprediction.net | Project requested delay of 3636 seconds That's all I'm getting for now, I'll enable a few more logging options and see if anything new comes up at the next update. |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
Log from latest work fetch request (I let BOINC do it on it's own, I didn't click update so it would do the full time-out) 22/02/2024 01:54:14 | climateprediction.net | [css] running wah2_eas25_a33x_200512_24_1007_012268885_0 ( ) 22/02/2024 01:54:14 | | [cpu_sched_debug] enforce_run_list: end 22/02/2024 01:54:26 | | choose_project(): 1708566866.014561 22/02/2024 01:54:26 | | [work_fetch] ------- start work fetch state ------- 22/02/2024 01:54:26 | | [work_fetch] target work buffer: 259200.00 + 259200.00 sec 22/02/2024 01:54:26 | | [work_fetch] --- project states --- 22/02/2024 01:54:26 | climateprediction.net | [work_fetch] REC 721.330 prio -0.699 can't request work: scheduler RPC backoff (3570.09 sec) 22/02/2024 01:54:26 | | [work_fetch] --- state for CPU --- 22/02/2024 01:54:26 | | [work_fetch] shortfall 1031812.16 nidle 0.00 saturated 2431.98 busy 0.00 22/02/2024 01:54:26 | climateprediction.net | [work_fetch] share 0.000 project is backed off (resource backoff: 5007.51, inc 4800.00) 22/02/2024 01:54:26 | | [work_fetch] ------- end work fetch state ------- 22/02/2024 01:54:26 | climateprediction.net | choose_project: scanning 22/02/2024 01:54:26 | climateprediction.net | skip: scheduler RPC backoff 22/02/2024 01:54:26 | | [work_fetch] No project chosen for work fetch |
©2024 cpdn.org