climateprediction.net home page
New Work Announcements 2024

New Work Announcements 2024

Message boards : Number crunching : New Work Announcements 2024
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 13 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,971,712
RAC: 21,921
Message 70362 - Posted: 13 Feb 2024, 9:42:26 UTC
Last modified: 13 Feb 2024, 10:05:58 UTC

Trello is playing up a bit. Just seen there that the code hasn't changed, just been recompiled but I managed to read that just before the thread disappeared leaving an error message in its place. Should mean the code runs about 10% faster from what I recall reading.

Edit: Trello boards seem to be behaving again now.
ID: 70362 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70366 - Posted: 13 Feb 2024, 12:52:25 UTC - in response to Message 70362.  

Depends which "code" you mean. The models themselves have had minor changes but nothing that would change the results. These are the wah2am and wah2rm processes.

The monitor code, wah2_8.29, which interfaces the models to boinc has had major changes to fix multiple errors.

To clear up few other points. The models are very robust, fault tolerant codes as they come from production environments. They do create checksums but only every ~5 mins or so depending on the computer speed. So if the task is frequently restarted because it's not kept in memory it has to start all over again. Changing the compute time from 100% doesn't affect the risk of the task failing, just takes longer. I keep mine at 80% to manage CPU temp.
---
CPDN Visiting Scientist
ID: 70366 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70367 - Posted: 13 Feb 2024, 12:57:32 UTC - in response to Message 70366.  

Depends which "code" you mean. The models themselves have had minor changes but nothing that would change the results. These are the wah2am and wah2rm processes.

The monitor code, wah2_8.29, which interfaces the models to boinc has had major changes to fix multiple errors.

To clear up few other points. The models are very robust, fault tolerant codes as they come from production environments. They do create checksums but only every ~5 mins or so depending on the computer speed. So if the task is frequently restarted because it's not kept in memory it has to start all over again. Changing the compute time from 100% doesn't affect the risk of the task failing, just takes longer. I keep mine at 80% to manage CPU temp.
Unless you live somewhere tropical, I can't imagine CPU temperature ever being a problem. A decent cooler keeps it 20C under the max, and if it hits the max, the clock auto-throttles without CPDN tasks even knowing about it.
ID: 70367 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70368 - Posted: 13 Feb 2024, 15:29:52 UTC

I've just looked at the error rate for the 1001 & 1006 batches. Both are running exact same forecasts but 1001 uses the old WaH app, and 1006 the new. The error rate is 115% for batch 1001 and 3% for batch 1006. That's a big improvement, just need to compare the results from the two batches to make sure they are consistent, but thanks to everyone for patience with the repeated earlier failures.

p.s. the 115% is because we count failure 'tasks' relative to 'workunits'; a workunit can be up to 3 tasks.
---
CPDN Visiting Scientist
ID: 70368 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70369 - Posted: 13 Feb 2024, 16:56:45 UTC - in response to Message 70368.  

The error rate is 115% for batch 1001 and 3% for batch 1006. That's a big improvement...


Indeed that is a massive improvement! Nice job! Hopefully the results all match and we can move forward with more reliable binaries!
ID: 70369 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,971,712
RAC: 21,921
Message 70370 - Posted: 13 Feb 2024, 17:34:56 UTC

I have one of 1006 running and three waiting to run. My 1001 tasks seem to be doing about 0.4%/hour as opposed to 0.32% an hour for the 1006. Both are on tiny10 using a VM. I had expected the recompiled code to be faster. I will check again when current work completes in case I have typed something stupid into the calculator. Will also check seconds /time step as reported averaged over a trickle.
ID: 70370 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70371 - Posted: 13 Feb 2024, 20:21:12 UTC - in response to Message 70370.  

Are you timing them by suspending the other tasks so only one is running? Need to quieten the machine to get a more reliable performance value.
ID: 70371 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70372 - Posted: 13 Feb 2024, 20:28:40 UTC - in response to Message 70371.  
Last modified: 13 Feb 2024, 20:29:03 UTC

Are you timing them by suspending the other tasks so only one is running? Need to quieten the machine to get a more reliable performance value.
It's also important to know how well they run along with other tasks. No good one being fast if running three is slower. We know CPDN tasks hate hyperthreading for example.
ID: 70372 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,971,712
RAC: 21,921
Message 70373 - Posted: 13 Feb 2024, 21:45:18 UTC

No, three of 2001 running plus 4 HADAM4H tasks. I am not going to draw firm conclusions till I have data from the other three that are waiting to run. What I posted is quite early data based on the first 10 hours of the 1006 task. What I have seen however does mean I will keep an eye on it.
ID: 70373 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70374 - Posted: 13 Feb 2024, 22:43:18 UTC

I've got some 1001, 1005, and 1006 tasks, they're all reporting 0.360%/hr on my compute VMs (3900X systems, running 12 core VMs, fully loaded). So no observed differences here, though system load has varied slightly depending on the # of tasks running.
ID: 70374 · Report as offensive     Reply Quote
Profile Thomas McFarland
Avatar

Send message
Joined: 28 Feb 05
Posts: 20
Credit: 11,118,168
RAC: 18,106
Message 70388 - Posted: 15 Feb 2024, 14:10:12 UTC - in response to Message 70369.  

I have 3 1001 tasks still running. Should I terminate them? I don't wantto waste resources if they are likely to fail.
ID: 70388 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70390 - Posted: 15 Feb 2024, 14:50:41 UTC - in response to Message 70388.  

I have 3 1001 tasks still running. Should I terminate them? I don't want to waste resources if they are likely to fail.
I've found 90% of failures are in the first 10 minutes. If you've had them running longer than that, they will probably be fine.
ID: 70390 · Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 22 Feb 11
Posts: 32
Credit: 226,546
RAC: 4,080
Message 70391 - Posted: 15 Feb 2024, 14:53:32 UTC

They also upload intermediate checkpoints to server, so progress will not be lost if workunit crashes later.
ID: 70391 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70392 - Posted: 15 Feb 2024, 15:01:44 UTC - in response to Message 70391.  

Can these half done workunits be handed out to others to complete? I've never been given a smaller workunit. Sounds like it should be possible. If my CPU can continue where it left off after rebooting the computer, why can't your CPU take my half done unit and continue it?
ID: 70392 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 70394 - Posted: 15 Feb 2024, 15:50:07 UTC - in response to Message 70392.  

Can these half done workunits be handed out to others to complete? I've never been given a smaller workunit. Sounds like it should be possible. If my CPU can continue where it left off after rebooting the computer, why can't your CPU take my half done unit and continue it?
Restarting after a reboot relies on the presence of checkpoint files on your local hard drive.

CPDN uploads data zips (intermediate weather reports) 24 times during each task in the current batch, but only one 'restart' file at about the mid-point. Glenn can confirm the technicals, but I would guess that a short completion run would only be feasible if the original host had uploaded that restart file before failing.
ID: 70394 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70395 - Posted: 15 Feb 2024, 15:57:04 UTC - in response to Message 70394.  
Last modified: 15 Feb 2024, 15:58:46 UTC

Since I can shut down my computer after only doing say 15% of the task, then continue when I restart, it must be possible for you to continue my task if you had all the files. Are these files not also given to the server, so could be sent to you, to avoid you having to start from the beginning?

Such a function would be useful for hosts which break, or the user aborts them for whatever reason, or they're just taking too long.
ID: 70395 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70398 - Posted: 15 Feb 2024, 16:13:46 UTC - in response to Message 70394.  
Last modified: 15 Feb 2024, 16:14:54 UTC

It's not possible to restart a workunit on someone else's machine halfway, for the simple reason we can't be certain a different host would give the same end result as the first. Information about the host that ran the workunit is stored with the result.
A failed workunit, for whatever reason, starts from the beginning on a new host every time.
---
CPDN Visiting Scientist
ID: 70398 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70403 - Posted: 15 Feb 2024, 21:21:15 UTC

CPDN will shortly be releasing another batch of WAH2 East Asia 25km (EAS25) using the new wah-ri app.
---
CPDN Visiting Scientist
ID: 70403 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 70404 - Posted: 15 Feb 2024, 22:26:19 UTC

"New wah-ri" - still Windows only, but hopefully far less crashy?
ID: 70404 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 70405 - Posted: 15 Feb 2024, 22:50:27 UTC - in response to Message 70404.  
Last modified: 15 Feb 2024, 23:04:39 UTC

The next batch is going out now.

Wah2 v8.29 is the new app. Currently deployed under the 'region independent' but it will eventually replace the normal WaH app (currently at v8.24) once batches using that version have finished/been closed.

v8.29 is much more stable than the old v8.24; for batch 1006 it's showing 7% task fails and only 9 hard fails out of 6044 workunits so far (a 'hard fail' is when all 3 attempted tasks fail). That is considerably less than the identical batch 1001; 121% and 1346 respectively.

The linux version needs verifying against a Windows batch before we can deploy it to production.

p.s. 30% of those fails for v8.29 are because the host antivirus software has quarantined the new exe and boinc can't start it.
---
CPDN Visiting Scientist
ID: 70405 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 13 · Next

Message boards : Number crunching : New Work Announcements 2024

©2024 cpdn.org