climateprediction.net (CPDN) home page
Thread 'East Asia testing.'

Thread 'East Asia testing.'

Message boards : Number crunching : East Asia testing.
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 68589 - Posted: 14 Mar 2023, 10:38:56 UTC

Too early to say if main site work will come from this. (My instinct is yes but don't hold your breath!) I am running 4 East Asia 25Km resolution tasks under wine from testing branch. Assuming they finish, they will take 48 days which is why I say, "don't hold your breath!" I have also been warned that the region crossing the Himalayas may cause them/some of them to go unstable so there may be higher than normal physics failures.
ID: 68589 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68590 - Posted: 14 Mar 2023, 13:33:37 UTC - in response to Message 68589.  
Last modified: 14 Mar 2023, 13:35:42 UTC

I might as well hold my breath because I am not getting any other work from CPDN these days. (Nor from WCG.) And this after raising my RAM. The disk space listed here is just the partition allocated solely to Boinc. So "Ready when you are, chief!" (Punch line from a joke-story.)

Memory 	125.34 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	479.26 GB
Measured floating point speed 	6.04 billion ops/sec
Measured integer speed 	24.55 billion ops/sec
Average upload rate 	146.64 KB/sec
Average download rate 	15542.13 KB/sec

ID: 68590 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68591 - Posted: 14 Mar 2023, 15:09:10 UTC - in response to Message 68590.  

Cecil B. DeMille wanted to make a really great scene of Moses and the Israelites crossing the Red Sea, so with all his skills, and all the producers' money he arranged with On High to have the sea part for long enough to film the scene. Not only that, but he had three camera crews filming the event. One on each side of the Red Sea, and on on top of the hill nearby. He then got Moses and the Israelites (actors and extras) to cross, but he was a little late and the sea closed in and drowned all of them. He turned to his cameraman and asked if he got the scene, and the cameraman apologized because, for the first time in his career, he forgot to take the lens cap off the camera lens. No problem said DeMille; Harry on the other side will have it. He shouts across, but Harry was so upset he could hardly answer: for the first time in his career, he forgot to load film into his camera. Thank goodness, said DeMille, Fred on the hill will have it. Fred! Did you get that? and Fred shouted back, "Ready when you are, chief!"
ID: 68591 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 68592 - Posted: 15 Mar 2023, 7:11:38 UTC

Because the region covered is much bigger than the ANZ region, these tasks will be long if they get here. Currently looking like about 50 days on my Ryzen. Slower machines will be well over 2 months even if running 24/7.
ID: 68592 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,837,643
RAC: 19,879
Message 68593 - Posted: 15 Mar 2023, 10:22:05 UTC - in response to Message 68592.  

Because the region covered is much bigger than the ANZ region, these tasks will be long if they get here. Currently looking like about 50 days on my Ryzen. Slower machines will be well over 2 months even if running 24/7.

Sounds like these are Windows Hadley models?
ID: 68593 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 68594 - Posted: 15 Mar 2023, 10:48:21 UTC - in response to Message 68593.  

Sounds like these are Windows Hadley models?
Yes the four I am running from testing branch are running under WINE.
ID: 68594 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 68596 - Posted: 17 Mar 2023, 16:39:14 UTC - in response to Message 68595.  

If these Windows work units are long running as you suggest, I hope there will be a mechanism in place on the server to ensure that everyone who wants some will get them, shared fairly and equally, instead of by greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline.
I have no idea on that. My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks. Slower machines will I would guess take over three months to complete these tasks so setting the deadline at 3 months would stop some users from downloading them at all because they wouldn't finish in time.
ID: 68596 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68597 - Posted: 17 Mar 2023, 17:03:26 UTC - in response to Message 68595.  

greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline.


Just so I do not be considered greedy, I do not wish to hog an unfair number of tasks, but OTOH, I do want a large enough number of tasks on my machine to coast me over the dead spots.

Right now, both CPDN and WCG have extremely long dead spots. CPDN just does not have tasks ready, and I would not wish to have many weeks of those work units, that used to have a 1-year deadlines. In the old days, new tasks were always available, so I did not need a large input queue.
WCG is just plain down for extremely long intervals (months), and has not really been running right in over a year. I do not remember how long their tasks take, but some of them were 8 hours or so. and others less. Some project had some 8-day tasks, but I do not remember if it was one of the WCG or not.

For one project I am on, DENIS, It downloaded about 100 tasks all at once. But they run pretty quickly (about 70 minutes) so I have no trouble completing them on time (deadline is about three days). Then they have several days of no work. This is not actually a complaint.
For another project (Einstein), I get only half a dozen at most, and I can complete them on time too. They take longer to run (about 11hours each).

As far as I know, the only way to get more tasks to download would be to set Options->Computing Preferences->Computing->Other to have higher Days of Work settings. What do other people consider fair settings for these? My setttings are
At least   0.5 days of work
Additional 1.0 days of work.

ID: 68597 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 68598 - Posted: 17 Mar 2023, 19:58:55 UTC - in response to Message 68596.  

If these Windows work units are long running as you suggest, I hope there will be a mechanism in place on the server to ensure that everyone who wants some will get them, shared fairly and equally, instead of by greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline.
I have no idea on that. My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks. Slower machines will I would guess take over three months to complete these tasks so setting the deadline at 3 months would stop some users from downloading them at all because they wouldn't finish in time.

If they do it the way they did for the nz25 batches, the development site spinups ran for 113 model months and took about 20 days on my i7-4790K. When they later sent out stash/ancil test nz25 batches to the dev site, they were for 25 model months and took less than a quarter of the time that the spinups did. The nz25 batches sent to the main cpdn site were also 25 model months. Just guessing but I don't think 119 model month batches will come to the main site.
ID: 68598 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,837,643
RAC: 19,879
Message 68599 - Posted: 17 Mar 2023, 21:24:46 UTC
Last modified: 17 Mar 2023, 21:25:45 UTC

My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks.

While it was helpful to add a 30 day grace period in addition to the 30 day deadline for OIFS tasks during the storage outage, it seems to me that it should've been removed once things stabilized. There are still almost 200 PS tasks out and from what I remember, the contract deadline was supposed to be end of February. BL, and regular OIFS apps also still have 100-150 tasks out each although I'm not sure if 30 days have passed yet on those.
Just so I do not be considered greedy, I do not wish to hog an unfair number of tasks, but OTOH, I do want a large enough number of tasks on my machine to coast me over the dead spots.

I agree, since work availability is not constant or consistent I also like to have a large enough cache store but not too big so that I can still finish all work by the 30 day deadline. It's not always possible as there are limits based on one's consecutive valid tasks.
ID: 68599 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 68671 - Posted: 18 Apr 2023, 19:41:08 UTC

If they do it the way they did for the nz25 batches, the development site spinups ran for 113 model months and took about 20 days on my i7-4790K. When they later sent out stash/ancil test nz25 batches to the dev site, they were for 25 model months and took less than a quarter of the time that the spinups did. The nz25 batches sent to the main cpdn site were also 25 model months. Just guessing but I don't think 119 model month batches will come to the main site.


I missed this in a post from George about a month ago. When the four I have running get nearer completion I might ask what the plan is. I also missed noticing the long spinup NZ tasks or I might have twigged that main site tasks might well be considerably shorter. Thanks for the reminder/hint by PM George.
ID: 68671 · Report as offensive     Reply Quote
Hans Sveen

Send message
Joined: 31 Aug 04
Posts: 5
Credit: 17,401,474
RAC: 5,243
Message 68922 - Posted: 23 Jun 2023, 11:38:45 UTC
Last modified: 23 Jun 2023, 11:43:48 UTC

Hello!
Now that the testing is over and the "real" working starts, I'll have got some wu's from Batch #994
and allas , so far several have errored out with Sement violation (Signal 11) and when trying to upload the result "upload failure: <file_xfer_error>"
"<error_code>-240 (stat() failed)</error_code>"
Is it my computers or the units that has some serious errors??

Example: https://www.cpdn.org/workunit.php?wuid=12216822
Hans Sveen
Oslo,Norway

ID: 68922 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 68923 - Posted: 23 Jun 2023, 12:25:01 UTC - in response to Message 68922.  
Last modified: 23 Jun 2023, 14:17:17 UTC

The file transfer error is just because the model crashed before the zip files could be created. One of the other moderators had the ones on his computer all crash, though at least one was due to a power outage while the task was running. I have one running so far and it is about two hours in. It is really too early to say if there are serious issues with these tasks or not yet. the ones of yours I looked at have all gone out again. I suspect the crashes you are experiencing which are all in the same ball park figure of cpu time are happening at the end of the first model day and as your machine has produced pretty consistent results in the past, it is probably something to do with the model. If it is we can expect a flurry of reports over the next day or two. The region being examined includes the Himalayas and the researcher did say when the testing tasks went out that it was possible that some of the tasks might be pushing the limits of the model.

Edit: In about three quarters of an hour, I will be able to get some more work. (The one task I have running is under WINE in a Linux VM.) When the WCG work I have running finishes, I will shut down the Linux client and get some tasks under WINE in the host machine. That will give me a chance to see more than just one task at a time.

Edit2: A further four tasks downloaded and have gotten past the point at which yours crashed. (all over one hour in.) However the sample size is still too small to draw any conclusions,

How many tasks are actually running at once? Is anything else using a lot of memory? Those are two things that potentially can cause problems.
ID: 68923 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68924 - Posted: 23 Jun 2023, 14:59:11 UTC

I just got a WAH2 task on my Windows 10 maachine. It failed after 3 minutes 31 seconds.

Also, sending the out.zip failure message is having trouble going up. I hit retry and it wants to wait. So I will wait.
ID: 68924 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 68927 - Posted: 23 Jun 2023, 16:28:36 UTC - in response to Message 68923.  

Looks like the regional model portion of the task takes 400+ MB of resident memory while the global portion of each task takes about 200 MB, so 600 to 700 MB total resident memory for each task.

My Ryzen 5600 is running 3 at a time and it works out to about 8.5 days to complete them.
ID: 68927 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68928 - Posted: 23 Jun 2023, 17:20:29 UTC
Last modified: 23 Jun 2023, 17:24:34 UTC

Yep, I had 4 WaH tasks and all crashed with a segV. Unfortunately they disappeared too quickly for me to see the detailed logs but it was the model that crashed and not the wrapper process.

Maybe it's a Win11 issue, AFAIK the model has not been recompiled for quite some time on Windows.

update: checking the WU I see the previous host also failed the task and was running Win10. Unlike OIFS there's very little returned to the task page making it v difficult to diagnose the problem
---
CPDN Visiting Scientist
ID: 68928 · Report as offensive     Reply Quote
DadX

Send message
Joined: 30 Aug 06
Posts: 27
Credit: 1,879,577
RAC: 1,213
Message 68929 - Posted: 23 Jun 2023, 17:41:36 UTC

I had 2 wah2_eas25_a3ny_201511_25_994_012220188 tasks and both crashed after 2 minutes. It's having trouble uploading them.

Regards,
DadX
ID: 68929 · Report as offensive     Reply Quote
DadX

Send message
Joined: 30 Aug 06
Posts: 27
Credit: 1,879,577
RAC: 1,213
Message 68930 - Posted: 23 Jun 2023, 17:43:56 UTC - in response to Message 68929.  

Correction
I had two tasks that failed wah2_eas25_a3ny_201511_25_994_012220188 and wah2_eas25_a1cm_199611_25_994_012217188 after 2 minutes.
Regards,
DadX
ID: 68930 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68931 - Posted: 23 Jun 2023, 18:05:30 UTC - in response to Message 68924.  

I am still waiting for the out.zip failure message file to upload.

I checked my Windows 10 machine hardware. I do not fully understand what these mean -- especially why the first two are different.

It has
16.0 GBytes Installed
15.6 GBytes Total
8.78 GBytes Available
18.0 GBytes Virtual
10.1 GBytes Available Virtual

The computer is running 6 Boinc tasks (no CPDN at the moment). The Boinc-client allows 7 tasks to run, but the various app_config files are limiting them to only six. The one for CPDN allows only one of those to run at a time.
ID: 68931 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,342,480
RAC: 10,485
Message 68932 - Posted: 23 Jun 2023, 18:12:06 UTC

Five tasks downloaded on this WIN 11 PC. Four crashed almost immedlately with the event log complaining about missing output files. The four took took a little while to report and std.err complains about Segment violations. Number 5 task is ticking along nicely. After four hours the task is 2.1 per cent though, giving a run-time estimate of 189 hours or about 8 days.
ID: 68932 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : East Asia testing.

©2024 cpdn.org