Message boards : Number crunching : New Work Announcements 2024
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,020,584 RAC: 20,684 |
Trello is playing up a bit. Just seen there that the code hasn't changed, just been recompiled but I managed to read that just before the thread disappeared leaving an error message in its place. Should mean the code runs about 10% faster from what I recall reading. Edit: Trello boards seem to be behaving again now. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Depends which "code" you mean. The models themselves have had minor changes but nothing that would change the results. These are the wah2am and wah2rm processes. The monitor code, wah2_8.29, which interfaces the models to boinc has had major changes to fix multiple errors. To clear up few other points. The models are very robust, fault tolerant codes as they come from production environments. They do create checksums but only every ~5 mins or so depending on the computer speed. So if the task is frequently restarted because it's not kept in memory it has to start all over again. Changing the compute time from 100% doesn't affect the risk of the task failing, just takes longer. I keep mine at 80% to manage CPU temp. --- CPDN Visiting Scientist |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Depends which "code" you mean. The models themselves have had minor changes but nothing that would change the results. These are the wah2am and wah2rm processes.Unless you live somewhere tropical, I can't imagine CPU temperature ever being a problem. A decent cooler keeps it 20C under the max, and if it hits the max, the clock auto-throttles without CPDN tasks even knowing about it. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I've just looked at the error rate for the 1001 & 1006 batches. Both are running exact same forecasts but 1001 uses the old WaH app, and 1006 the new. The error rate is 115% for batch 1001 and 3% for batch 1006. That's a big improvement, just need to compare the results from the two batches to make sure they are consistent, but thanks to everyone for patience with the repeated earlier failures. p.s. the 115% is because we count failure 'tasks' relative to 'workunits'; a workunit can be up to 3 tasks. --- CPDN Visiting Scientist |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
The error rate is 115% for batch 1001 and 3% for batch 1006. That's a big improvement... Indeed that is a massive improvement! Nice job! Hopefully the results all match and we can move forward with more reliable binaries! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,020,584 RAC: 20,684 |
I have one of 1006 running and three waiting to run. My 1001 tasks seem to be doing about 0.4%/hour as opposed to 0.32% an hour for the 1006. Both are on tiny10 using a VM. I had expected the recompiled code to be faster. I will check again when current work completes in case I have typed something stupid into the calculator. Will also check seconds /time step as reported averaged over a trickle. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Are you timing them by suspending the other tasks so only one is running? Need to quieten the machine to get a more reliable performance value. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Are you timing them by suspending the other tasks so only one is running? Need to quieten the machine to get a more reliable performance value.It's also important to know how well they run along with other tasks. No good one being fast if running three is slower. We know CPDN tasks hate hyperthreading for example. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,020,584 RAC: 20,684 |
No, three of 2001 running plus 4 HADAM4H tasks. I am not going to draw firm conclusions till I have data from the other three that are waiting to run. What I posted is quite early data based on the first 10 hours of the 1006 task. What I have seen however does mean I will keep an eye on it. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I've got some 1001, 1005, and 1006 tasks, they're all reporting 0.360%/hr on my compute VMs (3900X systems, running 12 core VMs, fully loaded). So no observed differences here, though system load has varied slightly depending on the # of tasks running. |
Send message Joined: 28 Feb 05 Posts: 20 Credit: 11,176,194 RAC: 18,215 |
I have 3 1001 tasks still running. Should I terminate them? I don't wantto waste resources if they are likely to fail. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I have 3 1001 tasks still running. Should I terminate them? I don't want to waste resources if they are likely to fail.I've found 90% of failures are in the first 10 minutes. If you've had them running longer than that, they will probably be fine. |
Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080 |
They also upload intermediate checkpoints to server, so progress will not be lost if workunit crashes later. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Can these half done workunits be handed out to others to complete? I've never been given a smaller workunit. Sounds like it should be possible. If my CPU can continue where it left off after rebooting the computer, why can't your CPU take my half done unit and continue it? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Can these half done workunits be handed out to others to complete? I've never been given a smaller workunit. Sounds like it should be possible. If my CPU can continue where it left off after rebooting the computer, why can't your CPU take my half done unit and continue it?Restarting after a reboot relies on the presence of checkpoint files on your local hard drive. CPDN uploads data zips (intermediate weather reports) 24 times during each task in the current batch, but only one 'restart' file at about the mid-point. Glenn can confirm the technicals, but I would guess that a short completion run would only be feasible if the original host had uploaded that restart file before failing. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Since I can shut down my computer after only doing say 15% of the task, then continue when I restart, it must be possible for you to continue my task if you had all the files. Are these files not also given to the server, so could be sent to you, to avoid you having to start from the beginning? Such a function would be useful for hosts which break, or the user aborts them for whatever reason, or they're just taking too long. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
It's not possible to restart a workunit on someone else's machine halfway, for the simple reason we can't be certain a different host would give the same end result as the first. Information about the host that ran the workunit is stored with the result. A failed workunit, for whatever reason, starts from the beginning on a new host every time. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
CPDN will shortly be releasing another batch of WAH2 East Asia 25km (EAS25) using the new wah-ri app. --- CPDN Visiting Scientist |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
"New wah-ri" - still Windows only, but hopefully far less crashy? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
The next batch is going out now. Wah2 v8.29 is the new app. Currently deployed under the 'region independent' but it will eventually replace the normal WaH app (currently at v8.24) once batches using that version have finished/been closed. v8.29 is much more stable than the old v8.24; for batch 1006 it's showing 7% task fails and only 9 hard fails out of 6044 workunits so far (a 'hard fail' is when all 3 attempted tasks fail). That is considerably less than the identical batch 1001; 121% and 1346 respectively. The linux version needs verifying against a Windows batch before we can deploy it to production. p.s. 30% of those fails for v8.29 are because the host antivirus software has quarantined the new exe and boinc can't start it. --- CPDN Visiting Scientist |
©2024 cpdn.org