Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
One of my boxes completed all uploads, the second however is now stuck with 2(sic) out of those thousands result files. Is/will the "gate" be opened permanently? After all, it seems to be the sole prupose of an upload server to be open for uploads?CPDN are talking to JASMIN who provide the cloud storage and servers for CPDN to manage. I don't think JASMIN have worked out exactly what the problem is yet. A fast connection would have let me clear everything before uploads stopped again. Given that they got it up and running yesterday, I am hopeful that later today it will be working again. I don't know if like yesterday, some data needs to be moved first. I am hoping not as David has asked for another 25GB of storage to run alongside what is already there. I am pretty certain that JASMIN does not routinely deal with the number of connections involved in this project or the amount of data involved, certainly not both together which I suspect is the root of the problem. On another subject, away from the upload problems I notice I have two zips that are zero bytes in length from different tasks. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Just out of a meeting with CPDN. They understand what the problem is. The upload server will be restarted but with reduced max http connections to keep it stable as best as possible. At some point very soon (today/tomorrow) they will move the upload server to a new RAID disk array (which is what's causing the problem). They may decide to do the move before restarting depending on how quickly the JASMIN cloud provider can give them the temporary space they need to move the files whilst setting up the new server. Regarding JASMIN's capacity for no. of simultaneous connections, I'm told these OpenIFS batches are not the highest load CPDN have ever seen and JASMIN has plenty of capacity. It's just the underlying disk system that was the issue (it's not the only boinc project to suffer raid disk array issues). |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Upload server update 9/1/23 10:49GMT From a meeting this morning with CPDN they do not expect the upload server to be available until 17:00GMT TOMORROW (10th) at the earliest. The server itself is running, but they have to move many Tbs of data but also want to monitor the newly configured server to check it is stable. As already said, these are issues caused by the cloud provider, not CPDN themselves. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,476,194 RAC: 1,633 |
I don't care for the content of this info, but thanks anyway. ;-) - - - - - - - - - - Greetings, Jens |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Upload server status: 10/Jan 16:00GMT Just spoken with CPDN. They have successfully migrated 1,000,000 of the 1,300,000 files onto the new block storage for the upload server. That process should be complete today but they will run checks first before opening up the upload server. I'll get an update tomorrow. |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,324,235 RAC: 10,236 |
Thank you Glenn for the regular communication and updates. As an IT programme director in a previous life 'communicate, communicate, communicate' was the most important requirement for an effective programme team, |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Thank you Glenn for the regular communication and updates.Absolutely. Before retirement, in my role at ECMWF I managed the international project to provide their IFS model as a community model (i.e. OpeniFS). Good communication was essential to make that work both internally (the hardest part!) and externally, not least because there's an amazing talent pool of users willing to get involved if they have the right means to do so. To their credit, CPDN know it's important, they just don't have enough resources to do it. in fact, they have barely any resources! They are quite dependent on volunteers helping out, which they acknowledge. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
To their credit, CPDN know it's important, they just don't have enough resources to do it. in fact, they have barely any resources! They are quite dependent on volunteers helping out, which they acknowledge. How much processing does the upload pipeline actually take? Would a random desktop with a software RAID of USB3 external drives be a viable backup for when the cloud falls to the ground again? Or even just a bunch of internal SATA drives. It's clear that the cloud provider is... less competent than desired, so having an alternative backup server that could easily get expanded storage capacity if needed would keep the volunteer CPUs purring away through disruptions. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
To their credit, CPDN know it's important, they just don't have enough resources to do it. in fact, they have barely any resources! They are quite dependent on volunteers helping out, which they acknowledge. My back of an envelope (mental arithmetic) calculation is that the uploads from the 40 batches will total about 80TB before taking into account retreads. Not sure what the maximum transfer rate required would be. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Glenn mentioned that the upload server has got 25 TB storage attached. My understanding is that data are moved off to other storage in time. My guess is that the critical requirements on the upload server's storage subsystem are high IOPS and perhaps limited latency. (Accompanied by, of course, data integrity = error detection and correction, as a common requirement on file servers.) |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
My back of an envelope (mental arithmetic) calculation is that the uploads from the 40 batches will total about 80TB before taking into account retreads. Not sure what the maximum transfer rate required would be. 16TB USB3 external drives are under $300 on NewEgg. A set of 10 of those, RAID6 with a hot spare or two, would be plenty of space, and a couple USB3 controllers and hubs would get you reasonable performance from a bog standard desktop. Then duplicate them and ship a box of them... I've done roughly this before for some large data sets I was working with. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,622,177 RAC: 59,768 |
I've done roughly this before for some large data sets I was working with. This sounds like you are doing large sequential reads. It could be very different for seeking a lot of 15MB files while doing both read and writes. That would require good random I/O and latency. Anyway, guessing this isn't all that useful unless the team publishes actual workload telemetry, at minimum including read/write IOPS and byte rate, ideally also some trace to show the pattern. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100 |
Glenn did also report (message 67362): The upload server will be restarted but with reduced max http connectionsWhat we don't yet know is how that reduced number of concurrent connections will compare with the number of individual uploads being attempted simultaneously by the massed ranks of CPDN volunteers. If it's fewer, connection attempts will be rejected, BOINC backoffs will follow, and the whole process will be slowed. If the number of concurrent connections is significantly lower than the number of simultaneous upload attempts (not computers attempting to upload), the congestion will be even worse. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Update. 22:30. 10/Jan Update from CPDN. The data move has been completed and the upload server has been enabled. Uploads should get moving again. The previous message & comment about restricting http connections is old news and referred to an earlier attempt to restart the original configuration. This is a new implementation of the data store for the upload server which does not have any restrictions. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100 |
I tried 'retry transfer' - it tried 2x2, and they all went into 'project backoff'. Event Log reports: 10/01/2023 22:44:20 | climateprediction.net | [http] [ID#15306] Info: connect to 192.171.169.187 port 80 failed: No route to host 10/01/2023 22:44:20 | climateprediction.net | [http] [ID#15306] Info: Failed to connect to upload11.cpdn.org port 80: No route to host 10/01/2023 22:44:20 | climateprediction.net | [http] HTTP error: Couldn't connect to server |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I'm seeing 'no route to host' errors too. Possible the DNS cache needs updating on our machines (or their end) for the new incarnation. Maybe something at upload server needs re-enabling. I've let them know. Should be a quick one to sort out. Edit: actually that's not entirely accurate. I can ping upload11.cpdn.org and 'traceroute upload11.cpdn.org' also works. But port 80 doesn't appear to be open: $ echo > /dev/tcp/192.171.169.187/80 && echo 'port open' bash: connect: No route to host bash: /dev/tcp/192.171.169.187/80: No route to hostShould be quick fix come office hrs. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100 |
Possibly a 'secure connections only' policy, or DDOS protection - either in the server or firewall? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Just had confirmation from CPDN that the upload server is now fully functional. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Upcoming work... There are two other OpenIFS projects with work about ready to go. One will be the OpenIFS 'BL: baroclinic lifecycle' app. This looks at idealized storms in a changing climate. The model runs are much shorter than the OpenIFS PS app. The other project uses the standard OpenIFS model for some atmospheric perturbations studies. Neither of these will involve as many batches as the current PS app. Release of these workunits is pending testing of some code changes I'm making following feedback & study of the issues arising from the OpenIFS PS app batches. In short, they'll be no shortage of OpenIFS work for some time. And just a reminder, please do not over-provision memory for these OpenIFS tasks and if you have a low memory machine (virtual or real), 8Gb or less, only allow 1 task at a time, or best use an app_config.xml file to control how many OpenIFS tasks are started simultaneously. The boinc client does not understand the memory needs of these tasks well enough and can start too many at once crashing the tasks. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
And just a reminder, please do not over-provision memory for these OpenIFS tasks and if you have a low memory machine (virtual or real), 8Gb or less, only allow 1 task at a time, or best use an app_config.xml file to control how many OpenIFS tasks are started simultaneously. The boinc client does not understand the memory needs of these tasks well enough and can start too many at once crashing the tasks. At the moment, my app_config.xml file for CPDN is like this. [/var/lib/boinc/projects/climateprediction.net]$ cat app_config.xml <app_config> <project_max_concurrent>6</project_max_concurrent> <app> <name>oifs_43r3_bl</name> <max_concurrent>1</max_concurrent> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>5</max_concurrent> </app> <app> <name>oifs_43r3</name> <max_concurrent>1</max_concurrent> </app> </app_config> Lately it has been running all 5 oifs_43r3_ps at a time, and once it also ran one oifs_43r3_bl at the same time. Also, it ran the 5 5 oifs_43r3_ps and one hadsm4 at the same time. It usually runs other Boinc tasks from other projects too. I have been allowing the machine to use 12 cores of my 16 core machine. My Ram usage is like this. It looks like this will be enough. At the time it was said that there would be lots more _fs tasks than _bl tasks. Do you think, once the new _bl tasks become available, I should change the balance between the two? $ free -hw total used free shared buffers cache available Mem: 62Gi 21Gi 3.9Gi 98Mi 110Mi 36Gi 39Gi |
©2024 cpdn.org