climateprediction.net (CPDN) home page
Thread 'WCG African Rainfall Project (ARP) restart update Apr 25, 2024'

Thread 'WCG African Rainfall Project (ARP) restart update Apr 25, 2024'

Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,855
RAC: 4,073
Message 71835 - Posted: 5 Nov 2024, 16:54:43 UTC - in response to Message 71834.  

Their new project is running at 1km resolution. That's very high, 25km grid resolution


Is the "25km grid" a 5x5 square (25 square km), or a 25x25 square (625 square km)?
If it's the first then it's a fair size step, if the latter then it's a really big step in resolution.
ID: 71835 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71836 - Posted: 5 Nov 2024, 17:13:22 UTC - in response to Message 71835.  

Is the "25km grid" a 5x5 square (25 square km), or a 25x25 square (625 square km)?
If it's the first then it's a fair size step, if the latter then it's a really big step in resolution.
Yes, sorry, the WaH model grid is made up of 25km x 25km squares (roughly) (e.g. the East Asia batches). So the ARP grid is 1km x 1km squares.

To run OpenIFS on a global 1km grid would take a top-end supercomputer. It has been done using the Oak Ridge Summit machine (see: https://www.ecmwf.int/en/about/media-centre/science-blog/2020/baseline-global-weather-and-climate-simulations-1-km) but the output was enormous.
---
CPDN Visiting Scientist
ID: 71836 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71837 - Posted: 5 Nov 2024, 19:53:29 UTC - in response to Message 71834.  
Last modified: 5 Nov 2024, 19:55:53 UTC

That means much shorter task times, greater number of available workunits, and lower memory overhead.


Perhaps so. Here is what one of them is doing on my Linux machine. It runs six of these at a time. They tend to take 11 hours to run; Between 10 and 12 hours I suppose. It depends on what else the machine is doing. It is now cold outside, so I run 13 Boinc tasks at a time, if they are available. Each task uses about 0.6% of the 128 GBytes of RAM this machine has,
Application  Africa Rainfall Project 7.32 
Name         ARP1_0034274_139
State        Running
Received        Mon 04 Nov 2024 02:56:52 AM EST
Report deadline Sun 10 Nov 2024 02:56:51 AM EST
Estimated computation size 211,701 GFLOPs
CPU time                  10:15:46
CPU time since checkpoint 00:37:56
Elapsed time              10:20:49
Estimated time remaining  00:41:02
Fraction done 94.083%
Virtual memory size 815.74 MB
Working set size    742.96 MB
Directory        slots/9
Process ID     10579
Progress rate  8.280% per hour
Executable     wcgrid_arp1_wrf_7.32_x86_64-pc-linux-gnu

ID: 71837 · Report as offensive     Reply Quote
pututu

Send message
Joined: 18 Jun 17
Posts: 18
Credit: 9,522,208
RAC: 46,093
Message 71839 - Posted: 6 Nov 2024, 4:17:16 UTC
Last modified: 6 Nov 2024, 4:25:12 UTC

FYI, the WCG ARP project is been led by researchers at Delft University of Technology and Krembil is hosting it as you all already know.
https://www.worldcommunitygrid.org/research/arp1/researchers.s

After the WCG migration, Krembil is not up to the task of supporting the networking infrastructure, servers and manpower needed for this sub-project and also other sub-projects (e.g. MCM). IBM would surely do better but they decided to quit supporting WCG.

Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km.
ID: 71839 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 71841 - Posted: 6 Nov 2024, 8:13:23 UTC - in response to Message 71839.  

Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km.
My guess is that the researchers have concluded they need this high resolution to develop really accurate forecasts for the areas in question.

With regards to the infrastructure, my completed tasks after the first couple have all uploaded without any intervention via the retry pending uploads button. But, longer term they do need a major upgrade of their infrastructure and from CPDN we know that major changes to infrastructure can at least initially cause more problems than they solve!
ID: 71841 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71842 - Posted: 6 Nov 2024, 9:34:22 UTC - in response to Message 71839.  
Last modified: 6 Nov 2024, 9:37:04 UTC

Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km.
The Delft project is studying rainfall over Africa. That's mainly convective rainfall; thunderstorms and mesoscale systems. So the 1km resolution is needed to resolve those features.

I can't connect to the WCG website at the moment - their network may be congested!
---
CPDN Visiting Scientist
ID: 71842 · Report as offensive     Reply Quote
pututu

Send message
Joined: 18 Jun 17
Posts: 18
Credit: 9,522,208
RAC: 46,093
Message 71847 - Posted: 6 Nov 2024, 18:26:23 UTC
Last modified: 6 Nov 2024, 18:32:07 UTC

From google, sub-Saharan Africa encompasses about 24M sq.km. Looking at ARP stat here: https://www.worldcommunitygrid.org/stat/viewProjectsHistory.do?pageNum=1&numRecordsPerPage=365, on any of the past few days, not more than 2000 results were returned. Assuming 1 result is one sq.km of data (I honestly don't know, just a guess), it would take a very very long time to finish crunching the sub-Saharan area. I'm not an expert in climate modeling but perhaps they may skip grids and do interpolation?

If they can fix the network infrastructure issue and can get 20,000 tasks processed per day like in 2021 (with IBM), that would take slightly more than 3 years to complete each 1 km grid.
ID: 71847 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,749,041
RAC: 63,360
Message 71848 - Posted: 6 Nov 2024, 19:34:28 UTC - in response to Message 71847.  
Last modified: 6 Nov 2024, 19:36:45 UTC

So far I still have 60+ WU stuck in downloading and only one or two finished downloading yesterday. Most of them will probably time out before they even start. My buffer setting is just 0.2 days and the work fetch already handed me enough that the server couldn't handle. They probably should rate limit how many tasks each host are handed out at a time with a cool down period like CPDN.
ID: 71848 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71849 - Posted: 6 Nov 2024, 21:17:50 UTC - in response to Message 71847.  
Last modified: 6 Nov 2024, 21:20:20 UTC

From google, sub-Saharan Africa encompasses about 24M sq.km. Looking at ARP stat here: https://www.worldcommunitygrid.org/stat/viewProjectsHistory.do?pageNum=1&numRecordsPerPage=365, on any of the past few days, not more than 2000 results were returned. Assuming 1 result is one sq.km of data (I honestly don't know, just a guess), it would take a very very long time to finish crunching the sub-Saharan area. I'm not an expert in climate modeling but perhaps they may skip grids and do interpolation?
Not quite: 1 km is the resolution of the individual cells over the whole grid (whatever their total area of interest is). Each model instance will be running not a single grid cell but a small area of grid cells. I don't know what that is without checking but typically it might 100 cells x 100 cells, or, say, 100km by 100km per model instance. Put another way, each result returned covers 100km x 100km in this example. So that gives 2,400 model instances running to cover the 24M sq.km. If they get 2,000 results per day they are completing 48hrs of the forecast in just over a day (each model does 48hrs). That's doing ok.
---
CPDN Visiting Scientist
ID: 71849 · Report as offensive     Reply Quote
pututu

Send message
Joined: 18 Jun 17
Posts: 18
Credit: 9,522,208
RAC: 46,093
Message 71850 - Posted: 6 Nov 2024, 22:32:01 UTC - in response to Message 71849.  

Not quite: 1 km is the resolution of the individual cells over the whole grid (whatever their total area of interest is). Each model instance will be running not a single grid cell but a small area of grid cells. I don't know what that is without checking but typically it might 100 cells x 100 cells, or, say, 100km by 100km per model instance. Put another way, each result returned covers 100km x 100km in this example. So that gives 2,400 model instances running to cover the 24M sq.km. If they get 2,000 results per day they are completing 48hrs of the forecast in just over a day (each model does 48hrs). That's doing ok.


Thanks for sharing this.
ID: 71850 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,837,643
RAC: 19,879
Message 71851 - Posted: 8 Nov 2024, 22:19:55 UTC

Downloads are still very problematic. Manual retries does help some. Seems like the way to do it is to run 1 or 2 tasks at a time so that by the time those finish the downloads for the next 1 or 2 should be able to complete.
ID: 71851 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,749,041
RAC: 63,360
Message 71852 - Posted: 9 Nov 2024, 17:37:30 UTC - in response to Message 71851.  
Last modified: 9 Nov 2024, 17:37:52 UTC

Downloads are still very problematic. Manual retries does help some. Seems like the way to do it is to run 1 or 2 tasks at a time so that by the time those finish the downloads for the next 1 or 2 should be able to complete.

Well, at least my upload is stable at 10KB/s... Hopefully the transfer finishes before the deadline. 😂
ID: 71852 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71853 - Posted: 9 Nov 2024, 20:17:02 UTC - in response to Message 71852.  

Well, at least my upload is stable at 10KB/s... Hopefully the transfer finishes before the deadline. 😂


My upload speed is about ten times yours most of the time. I am on a gigabit/second fiber optic Internet link. I have seen greater than 1000 KB/s upload speeds once in a while, but those are very unusual and not long =lasting.

Download speeds are seldom over 200 KB/s, and usually 30 to 40 KB/s.

And it takes a lot of hand-holding to get the stuff downloaded.
ID: 71853 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 71856 - Posted: 10 Nov 2024, 16:26:38 UTC

Wow! I downloaded one task this morning in about 20 minutes, without clicking "Retry" on the transfers! A new record for this latest release of ARP tasks.
ID: 71856 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,837,643
RAC: 19,879
Message 71857 - Posted: 11 Nov 2024, 0:30:04 UTC - in response to Message 71856.  

Wow! I downloaded one task this morning in about 20 minutes, without clicking "Retry" on the transfers! A new record for this latest release of ARP tasks.

And then things went to being even worse than before. At least from what I'm experiencing.
ID: 71857 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71860 - Posted: 12 Nov 2024, 13:20:42 UTC
Last modified: 12 Nov 2024, 13:21:10 UTC

I had an odd behaviour from boinc client (7.24.1 linux). An ARP task was in the middle of downloading about 8 files, 4 completed ok but 4 timed out. Then the client switched the task to 'Download failed' instead of 'Downloading'. The log showed a checksum error on one of the files due to incorrect size.

What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'?
ID: 71860 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,706,621
RAC: 9,524
Message 71861 - Posted: 12 Nov 2024, 14:45:51 UTC - in response to Message 71860.  

What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'?
It's probably down to a twenty year old unspoken assumption, from the early design days of BOINC. I would guess that, with slower download speeds available and smaller local disks, they didn't think ahead to projects using multiple, multi-megabyte, downloads per task. And if they didn't think about the problem, they wouldn't have bothered to code the solution.

We've had a few of these legacy oversights retro-fitted over the years: it's worth raising an issue on GitHub.
ID: 71861 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71862 - Posted: 12 Nov 2024, 16:17:35 UTC - in response to Message 71861.  

What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'?
It's probably down to a twenty year old unspoken assumption, from the early design days of BOINC. I would guess that, with slower download speeds available and smaller local disks, they didn't think ahead to projects using multiple, multi-megabyte, downloads per task. And if they didn't think about the problem, they wouldn't have bothered to code the solution.

We've had a few of these legacy oversights retro-fitted over the years: it's worth raising an issue on GitHub.
will do, thanks Richard.
ID: 71862 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 71870 - Posted: 19 Nov 2024, 18:44:23 UTC

As the last of the resends of CPDN work finish and the number of ARP tasks running increases I am building up a large backlog of uploads that need to clear in order to not get the too many uploads in progress explaining no more work being sent. I think I will set WCG to no new tasks till they clear a bit.
ID: 71870 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,986,666
RAC: 14,307
Message 71871 - Posted: 19 Nov 2024, 23:15:52 UTC - in response to Message 71870.  

I've had similar problems with stuck uploads on WCG ARP results. Have also set to no new asks and doing manual retries for the transfers. Will give up when they have cleared.
ID: 71871 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024

©2024 cpdn.org