Thread 'New Work Announcements 2024'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024	Message 70407 - Posted: 16 Feb 2024, 5:50:55 UTC #1007 EASHA 6400 2024-02-15 WAH2 East Asia 25km 1986-2018 ID: 70407 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 70415 - Posted: 16 Feb 2024, 16:27:10 UTC - in response to Message 70405. v8.29 is much more stable than the old v8.24; for batch 1006 it's showing 7% task fails and only 9 hard fails out of 6044 workunits so far (a 'hard fail' is when all 3 attempted tasks fail). That is considerably less than the identical batch 1001; 121% and 1346 respectively. Excellent, that's far better results! Of those hard fails, are they still "code related crashes" (segfaults, failure to resume, etc), or are they things outside your control (AV rejection of the binary, world going impossible, looks-like-bad-hardware)? The linux version needs verifying against a Windows batch before we can deploy it to production. I'm always willing and able to throw Linux boxes (mostly AMD right now) at a problem! :) ID: 70415 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70433 - Posted: 18 Feb 2024, 21:45:07 UTC - in response to Message 70415. Last modified: 18 Feb 2024, 21:47:31 UTC Of those hard fails, are they still "code related crashes" (segfaults, failure to resume, etc), or are they things outside your control (AV rejection of the binary, world going impossible, looks-like-bad-hardware)? I'm analysing the failures. CPDN have a process which looks at the output from each failed task and plots a nice histogram of each failure type. If it wasn't such a faff to include an image here I'd show it. About 30% of fails from the new app are due to AV quarantining when it tries to start. About 10-15% are other Windows related errors. Then it's download errors, user aborts etc. But about 40% are 'unclassified' which means we aren't able to easily determine what caused the task to fail judging from the log; could be our code, could be boinc, could be the machine. The 8.29 app is not producing any of the segmentation faults we saw before with the 8.24 app though, which is good. We should get a much more acceptable hard fail rate with the new app. There are at least 3 more EAS25 batches to come in the next couple of weeks. Plenty of time to have a look at its performance. --- CPDN Visiting Scientist ID: 70433 · Reply Quote

gutelius Send message Joined: 11 Jan 22 Posts: 2 Credit: 2,382,635 RAC: 673	Message 70494 - Posted: 21 Feb 2024, 2:28:08 UTC Last modified: 21 Feb 2024, 2:31:53 UTC Hi, I'm usually a set and forget user that has rarely seen windows tasks and was just dumped a bunch of the EAS25 (had a few batch 1001 fail early on). It seems like the older 1001s have really slowed down in the last couple of days. Not sure if this is normal or if there are any configuration changes that would be a good idea. Happy to see that there is more coming out, just want to check if there is any suggestions for maximizing performance on this project. Right now I have 16 tasks from this project and 11 threads available for boinc, so at this moment 11 CPDN tasks are computing now that some urgent WCG tasks have finished. https://imgur.com/a/LWB3NAh Computer info: I7-12700k (8P cores active, with hyper-threading) 32GB ram 200GB dedicated SSD space (16GB in use) Simultaneously running FAH on two GPUs(using ~1 thread each) ID: 70494 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024	Message 70495 - Posted: 21 Feb 2024, 7:14:19 UTC - in response to Message 70494. o at this moment 11 CPDN tasks are computing now that some urgent WCG tasks have finished. My experience is that on my 16 thread ryzen (8 real cores) going above running 8 tasks concurrently actually results in a reduction in overall throughput with CPDN tasks. (There are other projects however where going above the 8 real cores does scale in something close to a linear manner.) ID: 70495 · Reply Quote

gutelius Send message Joined: 11 Jan 22 Posts: 2 Credit: 2,382,635 RAC: 673	Message 70496 - Posted: 21 Feb 2024, 7:22:25 UTC - in response to Message 70495. Last modified: 21 Feb 2024, 7:24:36 UTC Thanks. I also run a lot of WCG which works fine with full thread usage for my computer, so I'd rather not set the overall boinc CPU thread usage at half of what should be available. Is there a convenient way to set how many cores a project uses per task? ID: 70496 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70497 - Posted: 21 Feb 2024, 8:30:30 UTC Additional workunits for batch 1007 are going out today. They were omitted from the original send due to a misconfiguration. --- CPDN Visiting Scientist ID: 70497 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024	Message 70498 - Posted: 21 Feb 2024, 8:46:31 UTC Is there a convenient way to set how many cores a project uses per task? I can't think of an easy way to do it off hand. By the way if ARP tasks come back with WCG, they also suffer in the same way if you start using virtual cores. ID: 70498 · Reply Quote

kotenok2000 Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080	Message 70499 - Posted: 21 Feb 2024, 8:58:39 UTC Create app_config.xml in project directory Copy this there <app_config> <app> <name>wah2</name> <max_concurrent>4</max_concurrent> </app> </app_config> or <app_config> <project_max_concurrent>4</project_max_concurrent> </app_config> ID: 70499 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70500 - Posted: 21 Feb 2024, 10:00:05 UTC - in response to Message 70499. Last modified: 21 Feb 2024, 10:00:37 UTC Not quite, there are two apps for Weather@Home. wah2 & wah2_ri, all the latest batches are using wah2_ri. You need two different <app> sections if you are going to use <app>. Also, you need to tell the client to 'Reread the config files' otherwise this won't take effect until the next time the client is started. CPDN models are very floating point intensive. Since a cpu core only has one set of floating point units, two threads have to compete for resource. That's why your throughput drops. Checkout this post https://www.cpdn.org/forum_thread.php?id=9184&postid=68081 on these forums for an illustration and more explanation. <app_config> <app> <name>wah2</name> <max_concurrent>4</max_concurrent> </app> <app> <name>wah2_ri</name> <max_concurrent>4</max_concurrent> </app> </app_config> --- CPDN Visiting Scientist ID: 70500 · Reply Quote

Dark Angel Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174	Message 70501 - Posted: 21 Feb 2024, 10:30:49 UTC So is there any word on when further new work will drop? ID: 70501 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,706,621 RAC: 9,524	Message 70502 - Posted: 21 Feb 2024, 10:37:03 UTC - in response to Message 70500. Or, since all CPDN applications will be floating point intensive, and will all suffer from FPU congestion on a hyperthreaded CPU, you could use the single project-level tag instead: <project_max_concurrent>N</project_max_concurrent> For a full list of the available options, see the BOINC user manual. ID: 70502 · Reply Quote

kotenok2000 Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080	Message 70503 - Posted: 21 Feb 2024, 10:38:25 UTC - in response to Message 70501. as of 21 Feb 2024, 10:06:32 UTC there were 1052 unsent wah tasks. ID: 70503 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 70504 - Posted: 21 Feb 2024, 16:14:03 UTC - in response to Message 70503. as of 21 Feb 2024, 10:06:32 UTC there were 1052 unsent wah tasks. Yeah, I lit up a new VM to chew on a few of those. I don't think there's more than a day or two before they're drained out, though (and it's a new machine, so it's in the "task quota limit" period - but should get 'em chewed pretty fast with few tasks on a big CPU). There's always resend work for a while after the count goes to zero, though. ID: 70504 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024	Message 70505 - Posted: 21 Feb 2024, 21:39:54 UTC - in response to Message 70501. Last modified: 21 Feb 2024, 21:46:44 UTC So is there any word on when further new work will drop? Server status currently showing 704 tasks ready to send, though doubtless that has dropped a bit since the last server update. I am guessing it may not be till next week that we get another of the batches that was mis configured sent out. The person who normally sends batches out is away and I don't know how much time Glenn has free to do this. If he doesn't have time it will have to wait till the person who normally does it is back. Edit, 704 was from the newest batch. there were also a few retreads from 1001. ID: 70505 · Reply Quote

Dark Angel Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174	Message 70506 - Posted: 21 Feb 2024, 22:16:02 UTC - in response to Message 70505. So is there any word on when further new work will drop? Server status currently showing 704 tasks ready to send, though doubtless that has dropped a bit since the last server update. I am guessing it may not be till next week that we get another of the batches that was mis configured sent out. The person who normally sends batches out is away and I don't know how much time Glenn has free to do this. If he doesn't have time it will have to wait till the person who normally does it is back. Edit, 704 was from the newest batch. there were also a few retreads from 1001. For some reason it's not letting me have any. I upped the number of CPU cores and RAM in my VM last night to do more, extended my work cache settings, and freed up disk space, but it's still not giving me any more than the three I currently have. ID: 70506 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 70507 - Posted: 21 Feb 2024, 23:14:51 UTC - in response to Message 70506. For some reason it's not letting me have any. I upped the number of CPU cores and RAM in my VM last night to do more, extended my work cache settings, and freed up disk space, but it's still not giving me any more than the three I currently have. What's your client log say about the reason it's not requesting new work? There's usually some obvious-ish reason listed. ID: 70507 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70508 - Posted: 21 Feb 2024, 23:47:13 UTC - in response to Message 70505. So is there any word on when further new work will drop? Server status currently showing 704 tasks ready to send, though doubtless that has dropped a bit since the last server update. I am guessing it may not be till next week that we get another of the batches that was mis configured sent out. The person who normally sends batches out is away and I don't know how much time Glenn has free to do this. If he doesn't have time it will have to wait till the person who normally does it is back. I've been sending out the WaH2 EAS25 batches as soon as they are ready. The previous mis-configured batches are still being checked and aren't ready. Linux batches are not far away, again, still under test on the dev site. --- CPDN Visiting Scientist ID: 70508 · Reply Quote

Dark Angel Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174	Message 70509 - Posted: 22 Feb 2024, 1:02:49 UTC - in response to Message 70507. For some reason it's not letting me have any. I upped the number of CPU cores and RAM in my VM last night to do more, extended my work cache settings, and freed up disk space, but it's still not giving me any more than the three I currently have. What's your client log say about the reason it's not requesting new work? There's usually some obvious-ish reason listed. 22/02/2024 00:52:38 \| climateprediction.net \| Sending scheduler request: To fetch work. 22/02/2024 00:52:38 \| climateprediction.net \| Requesting new tasks for CPU 22/02/2024 00:52:41 \| climateprediction.net \| Scheduler request completed: got 0 new tasks 22/02/2024 00:52:41 \| climateprediction.net \| No tasks sent 22/02/2024 00:52:41 \| climateprediction.net \| Project requested delay of 3636 seconds That's all I'm getting for now, I'll enable a few more logging options and see if anything new comes up at the next update. ID: 70509 · Reply Quote

Dark Angel Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174	Message 70510 - Posted: 22 Feb 2024, 1:56:46 UTC Log from latest work fetch request (I let BOINC do it on it's own, I didn't click update so it would do the full time-out) 22/02/2024 01:54:14 \| climateprediction.net \| [css] running wah2_eas25_a33x_200512_24_1007_012268885_0 ( ) 22/02/2024 01:54:14 \| \| [cpu_sched_debug] enforce_run_list: end 22/02/2024 01:54:26 \| \| choose_project(): 1708566866.014561 22/02/2024 01:54:26 \| \| [work_fetch] ------- start work fetch state ------- 22/02/2024 01:54:26 \| \| [work_fetch] target work buffer: 259200.00 + 259200.00 sec 22/02/2024 01:54:26 \| \| [work_fetch] --- project states --- 22/02/2024 01:54:26 \| climateprediction.net \| [work_fetch] REC 721.330 prio -0.699 can't request work: scheduler RPC backoff (3570.09 sec) 22/02/2024 01:54:26 \| \| [work_fetch] --- state for CPU --- 22/02/2024 01:54:26 \| \| [work_fetch] shortfall 1031812.16 nidle 0.00 saturated 2431.98 busy 0.00 22/02/2024 01:54:26 \| climateprediction.net \| [work_fetch] share 0.000 project is backed off (resource backoff: 5007.51, inc 4800.00) 22/02/2024 01:54:26 \| \| [work_fetch] ------- end work fetch state ------- 22/02/2024 01:54:26 \| climateprediction.net \| choose_project: scanning 22/02/2024 01:54:26 \| climateprediction.net \| skip: scheduler RPC backoff 22/02/2024 01:54:26 \| \| [work_fetch] No project chosen for work fetch ID: 70510 · Reply Quote