Thread 'New Work Announcements 2024'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762	Message 70362 - Posted: 13 Feb 2024, 9:42:26 UTC Last modified: 13 Feb 2024, 10:05:58 UTC Trello is playing up a bit. Just seen there that the code hasn't changed, just been recompiled but I managed to read that just before the thread disappeared leaving an error message in its place. Should mean the code runs about 10% faster from what I recall reading. Edit: Trello boards seem to be behaving again now. ID: 70362 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70366 - Posted: 13 Feb 2024, 12:52:25 UTC - in response to Message 70362. Depends which "code" you mean. The models themselves have had minor changes but nothing that would change the results. These are the wah2am and wah2rm processes. The monitor code, wah2_8.29, which interfaces the models to boinc has had major changes to fix multiple errors. To clear up few other points. The models are very robust, fault tolerant codes as they come from production environments. They do create checksums but only every ~5 mins or so depending on the computer speed. So if the task is frequently restarted because it's not kept in memory it has to start all over again. Changing the compute time from 100% doesn't affect the risk of the task failing, just takes longer. I keep mine at 80% to manage CPU temp. --- CPDN Visiting Scientist ID: 70366 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 70367 - Posted: 13 Feb 2024, 12:57:32 UTC - in response to Message 70366. Depends which "code" you mean. The models themselves have had minor changes but nothing that would change the results. These are the wah2am and wah2rm processes. The monitor code, wah2_8.29, which interfaces the models to boinc has had major changes to fix multiple errors. To clear up few other points. The models are very robust, fault tolerant codes as they come from production environments. They do create checksums but only every ~5 mins or so depending on the computer speed. So if the task is frequently restarted because it's not kept in memory it has to start all over again. Changing the compute time from 100% doesn't affect the risk of the task failing, just takes longer. I keep mine at 80% to manage CPU temp. Unless you live somewhere tropical, I can't imagine CPU temperature ever being a problem. A decent cooler keeps it 20C under the max, and if it hits the max, the clock auto-throttles without CPDN tasks even knowing about it. ID: 70367 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70368 - Posted: 13 Feb 2024, 15:29:52 UTC I've just looked at the error rate for the 1001 & 1006 batches. Both are running exact same forecasts but 1001 uses the old WaH app, and 1006 the new. The error rate is 115% for batch 1001 and 3% for batch 1006. That's a big improvement, just need to compare the results from the two batches to make sure they are consistent, but thanks to everyone for patience with the repeated earlier failures. p.s. the 115% is because we count failure 'tasks' relative to 'workunits'; a workunit can be up to 3 tasks. --- CPDN Visiting Scientist ID: 70368 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 70369 - Posted: 13 Feb 2024, 16:56:45 UTC - in response to Message 70368. The error rate is 115% for batch 1001 and 3% for batch 1006. That's a big improvement... Indeed that is a massive improvement! Nice job! Hopefully the results all match and we can move forward with more reliable binaries! ID: 70369 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762	Message 70370 - Posted: 13 Feb 2024, 17:34:56 UTC I have one of 1006 running and three waiting to run. My 1001 tasks seem to be doing about 0.4%/hour as opposed to 0.32% an hour for the 1006. Both are on tiny10 using a VM. I had expected the recompiled code to be faster. I will check again when current work completes in case I have typed something stupid into the calculator. Will also check seconds /time step as reported averaged over a trickle. ID: 70370 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70371 - Posted: 13 Feb 2024, 20:21:12 UTC - in response to Message 70370. Are you timing them by suspending the other tasks so only one is running? Need to quieten the machine to get a more reliable performance value. ID: 70371 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 70372 - Posted: 13 Feb 2024, 20:28:40 UTC - in response to Message 70371. Last modified: 13 Feb 2024, 20:29:03 UTC Are you timing them by suspending the other tasks so only one is running? Need to quieten the machine to get a more reliable performance value. It's also important to know how well they run along with other tasks. No good one being fast if running three is slower. We know CPDN tasks hate hyperthreading for example. ID: 70372 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762	Message 70373 - Posted: 13 Feb 2024, 21:45:18 UTC No, three of 2001 running plus 4 HADAM4H tasks. I am not going to draw firm conclusions till I have data from the other three that are waiting to run. What I posted is quite early data based on the first 10 hours of the 1006 task. What I have seen however does mean I will keep an eye on it. ID: 70373 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 70374 - Posted: 13 Feb 2024, 22:43:18 UTC I've got some 1001, 1005, and 1006 tasks, they're all reporting 0.360%/hr on my compute VMs (3900X systems, running 12 core VMs, fully loaded). So no observed differences here, though system load has varied slightly depending on the # of tasks running. ID: 70374 · Reply Quote

Thomas McFarland Send message Joined: 28 Feb 05 Posts: 20 Credit: 11,177,022 RAC: 18,222	Message 70388 - Posted: 15 Feb 2024, 14:10:12 UTC - in response to Message 70369. I have 3 1001 tasks still running. Should I terminate them? I don't wantto waste resources if they are likely to fail. ID: 70388 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 70390 - Posted: 15 Feb 2024, 14:50:41 UTC - in response to Message 70388. I have 3 1001 tasks still running. Should I terminate them? I don't want to waste resources if they are likely to fail. I've found 90% of failures are in the first 10 minutes. If you've had them running longer than that, they will probably be fine. ID: 70390 · Reply Quote

kotenok2000 Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080	Message 70391 - Posted: 15 Feb 2024, 14:53:32 UTC They also upload intermediate checkpoints to server, so progress will not be lost if workunit crashes later. ID: 70391 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 70392 - Posted: 15 Feb 2024, 15:01:44 UTC - in response to Message 70391. Can these half done workunits be handed out to others to complete? I've never been given a smaller workunit. Sounds like it should be possible. If my CPU can continue where it left off after rebooting the computer, why can't your CPU take my half done unit and continue it? ID: 70392 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361	Message 70394 - Posted: 15 Feb 2024, 15:50:07 UTC - in response to Message 70392. Can these half done workunits be handed out to others to complete? I've never been given a smaller workunit. Sounds like it should be possible. If my CPU can continue where it left off after rebooting the computer, why can't your CPU take my half done unit and continue it? Restarting after a reboot relies on the presence of checkpoint files on your local hard drive. CPDN uploads data zips (intermediate weather reports) 24 times during each task in the current batch, but only one 'restart' file at about the mid-point. Glenn can confirm the technicals, but I would guess that a short completion run would only be feasible if the original host had uploaded that restart file before failing. ID: 70394 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 70395 - Posted: 15 Feb 2024, 15:57:04 UTC - in response to Message 70394. Last modified: 15 Feb 2024, 15:58:46 UTC Since I can shut down my computer after only doing say 15% of the task, then continue when I restart, it must be possible for you to continue my task if you had all the files. Are these files not also given to the server, so could be sent to you, to avoid you having to start from the beginning? Such a function would be useful for hosts which break, or the user aborts them for whatever reason, or they're just taking too long. ID: 70395 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70398 - Posted: 15 Feb 2024, 16:13:46 UTC - in response to Message 70394. Last modified: 15 Feb 2024, 16:14:54 UTC It's not possible to restart a workunit on someone else's machine halfway, for the simple reason we can't be certain a different host would give the same end result as the first. Information about the host that ran the workunit is stored with the result. A failed workunit, for whatever reason, starts from the beginning on a new host every time. --- CPDN Visiting Scientist ID: 70398 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70403 - Posted: 15 Feb 2024, 21:21:15 UTC CPDN will shortly be releasing another batch of WAH2 East Asia 25km (EAS25) using the new wah-ri app. --- CPDN Visiting Scientist ID: 70403 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 70404 - Posted: 15 Feb 2024, 22:26:19 UTC "New wah-ri" - still Windows only, but hopefully far less crashy? ID: 70404 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70405 - Posted: 15 Feb 2024, 22:50:27 UTC - in response to Message 70404. Last modified: 15 Feb 2024, 23:04:39 UTC The next batch is going out now. Wah2 v8.29 is the new app. Currently deployed under the 'region independent' but it will eventually replace the normal WaH app (currently at v8.24) once batches using that version have finished/been closed. v8.29 is much more stable than the old v8.24; for batch 1006 it's showing 7% task fails and only 9 hard fails out of 6044 workunits so far (a 'hard fail' is when all 3 attempted tasks fail). That is considerably less than the identical batch 1001; 121% and 1346 respectively. The linux version needs verifying against a Windows batch before we can deploy it to production. p.s. 30% of those fails for v8.29 are because the host antivirus software has quarantined the new exe and boinc can't start it. --- CPDN Visiting Scientist ID: 70405 · Reply Quote