Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
Their new project is running at 1km resolution. That's very high, 25km grid resolution Is the "25km grid" a 5x5 square (25 square km), or a 25x25 square (625 square km)? If it's the first then it's a fair size step, if the latter then it's a really big step in resolution. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Is the "25km grid" a 5x5 square (25 square km), or a 25x25 square (625 square km)?Yes, sorry, the WaH model grid is made up of 25km x 25km squares (roughly) (e.g. the East Asia batches). So the ARP grid is 1km x 1km squares. To run OpenIFS on a global 1km grid would take a top-end supercomputer. It has been done using the Oak Ridge Summit machine (see: https://www.ecmwf.int/en/about/media-centre/science-blog/2020/baseline-global-weather-and-climate-simulations-1-km) but the output was enormous. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
That means much shorter task times, greater number of available workunits, and lower memory overhead. Perhaps so. Here is what one of them is doing on my Linux machine. It runs six of these at a time. They tend to take 11 hours to run; Between 10 and 12 hours I suppose. It depends on what else the machine is doing. It is now cold outside, so I run 13 Boinc tasks at a time, if they are available. Each task uses about 0.6% of the 128 GBytes of RAM this machine has, Application Africa Rainfall Project 7.32 Name ARP1_0034274_139 State Running Received Mon 04 Nov 2024 02:56:52 AM EST Report deadline Sun 10 Nov 2024 02:56:51 AM EST Estimated computation size 211,701 GFLOPs CPU time 10:15:46 CPU time since checkpoint 00:37:56 Elapsed time 10:20:49 Estimated time remaining 00:41:02 Fraction done 94.083% Virtual memory size 815.74 MB Working set size 742.96 MB Directory slots/9 Process ID 10579 Progress rate 8.280% per hour Executable wcgrid_arp1_wrf_7.32_x86_64-pc-linux-gnu |
Send message Joined: 18 Jun 17 Posts: 18 Credit: 9,520,551 RAC: 46,126 |
FYI, the WCG ARP project is been led by researchers at Delft University of Technology and Krembil is hosting it as you all already know. https://www.worldcommunitygrid.org/research/arp1/researchers.s After the WCG migration, Krembil is not up to the task of supporting the networking infrastructure, servers and manpower needed for this sub-project and also other sub-projects (e.g. MCM). IBM would surely do better but they decided to quit supporting WCG. Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024 |
Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km.My guess is that the researchers have concluded they need this high resolution to develop really accurate forecasts for the areas in question. With regards to the infrastructure, my completed tasks after the first couple have all uploaded without any intervention via the retry pending uploads button. But, longer term they do need a major upgrade of their infrastructure and from CPDN we know that major changes to infrastructure can at least initially cause more problems than they solve! |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Perhaps someone here can postulate why the researchers decided on a very high resolution grid of 1km.The Delft project is studying rainfall over Africa. That's mainly convective rainfall; thunderstorms and mesoscale systems. So the 1km resolution is needed to resolve those features. I can't connect to the WCG website at the moment - their network may be congested! --- CPDN Visiting Scientist |
Send message Joined: 18 Jun 17 Posts: 18 Credit: 9,520,551 RAC: 46,126 |
From google, sub-Saharan Africa encompasses about 24M sq.km. Looking at ARP stat here: https://www.worldcommunitygrid.org/stat/viewProjectsHistory.do?pageNum=1&numRecordsPerPage=365, on any of the past few days, not more than 2000 results were returned. Assuming 1 result is one sq.km of data (I honestly don't know, just a guess), it would take a very very long time to finish crunching the sub-Saharan area. I'm not an expert in climate modeling but perhaps they may skip grids and do interpolation? If they can fix the network infrastructure issue and can get 20,000 tasks processed per day like in 2021 (with IBM), that would take slightly more than 3 years to complete each 1 km grid. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,749,041 RAC: 63,360 |
So far I still have 60+ WU stuck in downloading and only one or two finished downloading yesterday. Most of them will probably time out before they even start. My buffer setting is just 0.2 days and the work fetch already handed me enough that the server couldn't handle. They probably should rate limit how many tasks each host are handed out at a time with a cool down period like CPDN. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
From google, sub-Saharan Africa encompasses about 24M sq.km. Looking at ARP stat here: https://www.worldcommunitygrid.org/stat/viewProjectsHistory.do?pageNum=1&numRecordsPerPage=365, on any of the past few days, not more than 2000 results were returned. Assuming 1 result is one sq.km of data (I honestly don't know, just a guess), it would take a very very long time to finish crunching the sub-Saharan area. I'm not an expert in climate modeling but perhaps they may skip grids and do interpolation?Not quite: 1 km is the resolution of the individual cells over the whole grid (whatever their total area of interest is). Each model instance will be running not a single grid cell but a small area of grid cells. I don't know what that is without checking but typically it might 100 cells x 100 cells, or, say, 100km by 100km per model instance. Put another way, each result returned covers 100km x 100km in this example. So that gives 2,400 model instances running to cover the 24M sq.km. If they get 2,000 results per day they are completing 48hrs of the forecast in just over a day (each model does 48hrs). That's doing ok. --- CPDN Visiting Scientist |
Send message Joined: 18 Jun 17 Posts: 18 Credit: 9,520,551 RAC: 46,126 |
Not quite: 1 km is the resolution of the individual cells over the whole grid (whatever their total area of interest is). Each model instance will be running not a single grid cell but a small area of grid cells. I don't know what that is without checking but typically it might 100 cells x 100 cells, or, say, 100km by 100km per model instance. Put another way, each result returned covers 100km x 100km in this example. So that gives 2,400 model instances running to cover the 24M sq.km. If they get 2,000 results per day they are completing 48hrs of the forecast in just over a day (each model does 48hrs). That's doing ok. Thanks for sharing this. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,837,643 RAC: 19,879 |
Downloads are still very problematic. Manual retries does help some. Seems like the way to do it is to run 1 or 2 tasks at a time so that by the time those finish the downloads for the next 1 or 2 should be able to complete. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,749,041 RAC: 63,360 |
Downloads are still very problematic. Manual retries does help some. Seems like the way to do it is to run 1 or 2 tasks at a time so that by the time those finish the downloads for the next 1 or 2 should be able to complete. Well, at least my upload is stable at 10KB/s... Hopefully the transfer finishes before the deadline. 😂 |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Well, at least my upload is stable at 10KB/s... Hopefully the transfer finishes before the deadline. 😂 My upload speed is about ten times yours most of the time. I am on a gigabit/second fiber optic Internet link. I have seen greater than 1000 KB/s upload speeds once in a while, but those are very unusual and not long =lasting. Download speeds are seldom over 200 KB/s, and usually 30 to 40 KB/s. And it takes a lot of hand-holding to get the stuff downloaded. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Wow! I downloaded one task this morning in about 20 minutes, without clicking "Retry" on the transfers! A new record for this latest release of ARP tasks. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,837,643 RAC: 19,879 |
Wow! I downloaded one task this morning in about 20 minutes, without clicking "Retry" on the transfers! A new record for this latest release of ARP tasks. And then things went to being even worse than before. At least from what I'm experiencing. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I had an odd behaviour from boinc client (7.24.1 linux). An ARP task was in the middle of downloading about 8 files, 4 completed ok but 4 timed out. Then the client switched the task to 'Download failed' instead of 'Downloading'. The log showed a checksum error on one of the files due to incorrect size. What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,706,621 RAC: 9,524 |
What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'?It's probably down to a twenty year old unspoken assumption, from the early design days of BOINC. I would guess that, with slower download speeds available and smaller local disks, they didn't think ahead to projects using multiple, multi-megabyte, downloads per task. And if they didn't think about the problem, they wouldn't have bothered to code the solution. We've had a few of these legacy oversights retro-fitted over the years: it's worth raising an issue on GitHub. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
will do, thanks Richard.What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'?It's probably down to a twenty year old unspoken assumption, from the early design days of BOINC. I would guess that, with slower download speeds available and smaller local disks, they didn't think ahead to projects using multiple, multi-megabyte, downloads per task. And if they didn't think about the problem, they wouldn't have bothered to code the solution. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024 |
As the last of the resends of CPDN work finish and the number of ARP tasks running increases I am building up a large backlog of uploads that need to clear in order to not get the too many uploads in progress explaining no more work being sent. I think I will set WCG to no new tasks till they clear a bit. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,986,666 RAC: 14,307 |
I've had similar problems with stuck uploads on WCG ARP results. Have also set to no new asks and doing manual retries for the transfers. Will give up when they have cleared. |
©2024 cpdn.org