Thread 'WCG African Rainfall Project (ARP) restart update Apr 25, 2024'

Author	Message
rob Send message Joined: 5 Jun 09 Posts: 99 Credit: 3,776,658 RAC: 1,196	Message 71835 - Posted: 5 Nov 2024, 16:54:43 UTC - in response to Message 71834. Their new project is running at 1km resolution. That's very high, 25km grid resolution Is the "25km grid" a 5x5 square (25 square km), or a 25x25 square (625 square km)? If it's the first then it's a fair size step, if the latter then it's a really big step in resolution. ID: 71835 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71836 - Posted: 5 Nov 2024, 17:13:22 UTC - in response to Message 71835. Is the "25km grid" a 5x5 square (25 square km), or a 25x25 square (625 square km)? If it's the first then it's a fair size step, if the latter then it's a really big step in resolution. Yes, sorry, the WaH model grid is made up of 25km x 25km squares (roughly) (e.g. the East Asia batches). So the ARP grid is 1km x 1km squares. To run OpenIFS on a global 1km grid would take a top-end supercomputer. It has been done using the Oak Ridge Summit machine (see: https://www.ecmwf.int/en/about/media-centre/science-blog/2020/baseline-global-weather-and-climate-simulations-1-km) but the output was enormous. --- CPDN Visiting Scientist ID: 71836 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 71837 - Posted: 5 Nov 2024, 19:53:29 UTC - in response to Message 71834. Last modified: 5 Nov 2024, 19:55:53 UTC That means much shorter task times, greater number of available workunits, and lower memory overhead. Perhaps so. Here is what one of them is doing on my Linux machine. It runs six of these at a time. They tend to take 11 hours to run; Between 10 and 12 hours I suppose. It depends on what else the machine is doing. It is now cold outside, so I run 13 Boinc tasks at a time, if they are available. Each task uses about 0.6% of the 128 GBytes of RAM this machine has, Application Africa Rainfall Project 7.32 Name ARP1_0034274_139 State Running Received Mon 04 Nov 2024 02:56:52 AM EST Report deadline Sun 10 Nov 2024 02:56:51 AM EST Estimated computation size 211,701 GFLOPs CPU time 10:15:46 CPU time since checkpoint 00:37:56 Elapsed time 10:20:49 Estimated time remaining 00:41:02 Fraction done 94.083% Virtual memory size 815.74 MB Working set size 742.96 MB Directory slots/9 Process ID 10579 Progress rate 8.280% per hour Executable wcgrid_arp1_wrf_7.32_x86_64-pc-linux-gnu ID: 71837 · Reply Quote

pututu Send message Joined: 18 Jun 17 Posts: 18 Credit: 10,293,533 RAC: 33,275	Message 71839 - Posted: 6 Nov 2024, 4:17:16 UTC Last modified: 6 Nov 2024, 4:25:12 UTC FYI, the WCG ARP project is been led by researchers at Delft University of Technology and Krembil is hosting it as you all already know. https://www.worldcommunitygrid.org/research/arp1/researchers.s After the WCG migration, Krembil is not up to the task of supporting the networking infrastructure, servers and manpower needed for this sub-project and also other sub-projects (e.g. MCM). IBM would surely do better but they decided to quit supporting WCG. Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km. ID: 71839 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 71841 - Posted: 6 Nov 2024, 8:13:23 UTC - in response to Message 71839. Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km. My guess is that the researchers have concluded they need this high resolution to develop really accurate forecasts for the areas in question. With regards to the infrastructure, my completed tasks after the first couple have all uploaded without any intervention via the retry pending uploads button. But, longer term they do need a major upgrade of their infrastructure and from CPDN we know that major changes to infrastructure can at least initially cause more problems than they solve! ID: 71841 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71842 - Posted: 6 Nov 2024, 9:34:22 UTC - in response to Message 71839. Last modified: 6 Nov 2024, 9:37:04 UTC Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km. The Delft project is studying rainfall over Africa. That's mainly convective rainfall; thunderstorms and mesoscale systems. So the 1km resolution is needed to resolve those features. I can't connect to the WCG website at the moment - their network may be congested! --- CPDN Visiting Scientist ID: 71842 · Reply Quote

pututu Send message Joined: 18 Jun 17 Posts: 18 Credit: 10,293,533 RAC: 33,275	Message 71847 - Posted: 6 Nov 2024, 18:26:23 UTC Last modified: 6 Nov 2024, 18:32:07 UTC From google, sub-Saharan Africa encompasses about 24M sq.km. Looking at ARP stat here: https://www.worldcommunitygrid.org/stat/viewProjectsHistory.do?pageNum=1&numRecordsPerPage=365, on any of the past few days, not more than 2000 results were returned. Assuming 1 result is one sq.km of data (I honestly don't know, just a guess), it would take a very very long time to finish crunching the sub-Saharan area. I'm not an expert in climate modeling but perhaps they may skip grids and do interpolation? If they can fix the network infrastructure issue and can get 20,000 tasks processed per day like in 2021 (with IBM), that would take slightly more than 3 years to complete each 1 km grid. ID: 71847 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 71848 - Posted: 6 Nov 2024, 19:34:28 UTC - in response to Message 71847. Last modified: 6 Nov 2024, 19:36:45 UTC So far I still have 60+ WU stuck in downloading and only one or two finished downloading yesterday. Most of them will probably time out before they even start. My buffer setting is just 0.2 days and the work fetch already handed me enough that the server couldn't handle. They probably should rate limit how many tasks each host are handed out at a time with a cool down period like CPDN. ID: 71848 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71849 - Posted: 6 Nov 2024, 21:17:50 UTC - in response to Message 71847. Last modified: 6 Nov 2024, 21:20:20 UTC From google, sub-Saharan Africa encompasses about 24M sq.km. Looking at ARP stat here: https://www.worldcommunitygrid.org/stat/viewProjectsHistory.do?pageNum=1&numRecordsPerPage=365, on any of the past few days, not more than 2000 results were returned. Assuming 1 result is one sq.km of data (I honestly don't know, just a guess), it would take a very very long time to finish crunching the sub-Saharan area. I'm not an expert in climate modeling but perhaps they may skip grids and do interpolation? Not quite: 1 km is the resolution of the individual cells over the whole grid (whatever their total area of interest is). Each model instance will be running not a single grid cell but a small area of grid cells. I don't know what that is without checking but typically it might 100 cells x 100 cells, or, say, 100km by 100km per model instance. Put another way, each result returned covers 100km x 100km in this example. So that gives 2,400 model instances running to cover the 24M sq.km. If they get 2,000 results per day they are completing 48hrs of the forecast in just over a day (each model does 48hrs). That's doing ok. --- CPDN Visiting Scientist ID: 71849 · Reply Quote

pututu Send message Joined: 18 Jun 17 Posts: 18 Credit: 10,293,533 RAC: 33,275	Message 71850 - Posted: 6 Nov 2024, 22:32:01 UTC - in response to Message 71849. Not quite: 1 km is the resolution of the individual cells over the whole grid (whatever their total area of interest is). Each model instance will be running not a single grid cell but a small area of grid cells. I don't know what that is without checking but typically it might 100 cells x 100 cells, or, say, 100km by 100km per model instance. Put another way, each result returned covers 100km x 100km in this example. So that gives 2,400 model instances running to cover the 24M sq.km. If they get 2,000 results per day they are completing 48hrs of the forecast in just over a day (each model does 48hrs). That's doing ok. Thanks for sharing this. ID: 71850 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 71851 - Posted: 8 Nov 2024, 22:19:55 UTC Downloads are still very problematic. Manual retries does help some. Seems like the way to do it is to run 1 or 2 tasks at a time so that by the time those finish the downloads for the next 1 or 2 should be able to complete. ID: 71851 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 71852 - Posted: 9 Nov 2024, 17:37:30 UTC - in response to Message 71851. Last modified: 9 Nov 2024, 17:37:52 UTC Downloads are still very problematic. Manual retries does help some. Seems like the way to do it is to run 1 or 2 tasks at a time so that by the time those finish the downloads for the next 1 or 2 should be able to complete. Well, at least my upload is stable at 10KB/s... Hopefully the transfer finishes before the deadline. ðŸ˜‚ ID: 71852 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 71853 - Posted: 9 Nov 2024, 20:17:02 UTC - in response to Message 71852. Well, at least my upload is stable at 10KB/s... Hopefully the transfer finishes before the deadline. ðŸ˜‚ My upload speed is about ten times yours most of the time. I am on a gigabit/second fiber optic Internet link. I have seen greater than 1000 KB/s upload speeds once in a while, but those are very unusual and not long =lasting. Download speeds are seldom over 200 KB/s, and usually 30 to 40 KB/s. And it takes a lot of hand-holding to get the stuff downloaded. ID: 71853 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 71856 - Posted: 10 Nov 2024, 16:26:38 UTC Wow! I downloaded one task this morning in about 20 minutes, without clicking "Retry" on the transfers! A new record for this latest release of ARP tasks. ID: 71856 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 71857 - Posted: 11 Nov 2024, 0:30:04 UTC - in response to Message 71856. Wow! I downloaded one task this morning in about 20 minutes, without clicking "Retry" on the transfers! A new record for this latest release of ARP tasks. And then things went to being even worse than before. At least from what I'm experiencing. ID: 71857 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71860 - Posted: 12 Nov 2024, 13:20:42 UTC Last modified: 12 Nov 2024, 13:21:10 UTC I had an odd behaviour from boinc client (7.24.1 linux). An ARP task was in the middle of downloading about 8 files, 4 completed ok but 4 timed out. Then the client switched the task to 'Download failed' instead of 'Downloading'. The log showed a checksum error on one of the files due to incorrect size. What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'? ID: 71860 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 71861 - Posted: 12 Nov 2024, 14:45:51 UTC - in response to Message 71860. What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'? It's probably down to a twenty year old unspoken assumption, from the early design days of BOINC. I would guess that, with slower download speeds available and smaller local disks, they didn't think ahead to projects using multiple, multi-megabyte, downloads per task. And if they didn't think about the problem, they wouldn't have bothered to code the solution. We've had a few of these legacy oversights retro-fitted over the years: it's worth raising an issue on GitHub. ID: 71861 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71862 - Posted: 12 Nov 2024, 16:17:35 UTC - in response to Message 71861. What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'? It's probably down to a twenty year old unspoken assumption, from the early design days of BOINC. I would guess that, with slower download speeds available and smaller local disks, they didn't think ahead to projects using multiple, multi-megabyte, downloads per task. And if they didn't think about the problem, they wouldn't have bothered to code the solution. We've had a few of these legacy oversights retro-fitted over the years: it's worth raising an issue on GitHub. will do, thanks Richard. ID: 71862 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 71870 - Posted: 19 Nov 2024, 18:44:23 UTC As the last of the resends of CPDN work finish and the number of ARP tasks running increases I am building up a large backlog of uploads that need to clear in order to not get the too many uploads in progress explaining no more work being sent. I think I will set WCG to no new tasks till they clear a bit. ID: 71870 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 493 Credit: 31,669,049 RAC: 10,904	Message 71871 - Posted: 19 Nov 2024, 23:15:52 UTC - in response to Message 71870. I've had similar problems with stuck uploads on WCG ARP results. Have also set to no new asks and doing manual retries for the transfers. Will give up when they have cleared. ID: 71871 · Reply Quote