Thread 'OpenIFS Discussion'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 67360 - Posted: 5 Jan 2023, 10:57:55 UTC - in response to Message 67351. One of my boxes completed all uploads, the second however is now stuck with 2(sic) out of those thousands result files. Is/will the "gate" be opened permanently? After all, it seems to be the sole prupose of an upload server to be open for uploads? CPDN are talking to JASMIN who provide the cloud storage and servers for CPDN to manage. I don't think JASMIN have worked out exactly what the problem is yet. A fast connection would have let me clear everything before uploads stopped again. Given that they got it up and running yesterday, I am hopeful that later today it will be working again. I don't know if like yesterday, some data needs to be moved first. I am hoping not as David has asked for another 25GB of storage to run alongside what is already there. I am pretty certain that JASMIN does not routinely deal with the number of connections involved in this project or the amount of data involved, certainly not both together which I suspect is the root of the problem. On another subject, away from the upload problems I notice I have two zips that are zero bytes in length from different tasks. ID: 67360 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 67362 - Posted: 5 Jan 2023, 14:20:30 UTC Last modified: 5 Jan 2023, 14:22:35 UTC Just out of a meeting with CPDN. They understand what the problem is. The upload server will be restarted but with reduced max http connections to keep it stable as best as possible. At some point very soon (today/tomorrow) they will move the upload server to a new RAID disk array (which is what's causing the problem). They may decide to do the move before restarting depending on how quickly the JASMIN cloud provider can give them the temporary space they need to move the files whilst setting up the new server. Regarding JASMIN's capacity for no. of simultaneous connections, I'm told these OpenIFS batches are not the highest load CPDN have ever seen and JASMIN has plenty of capacity. It's just the underlying disk system that was the issue (it's not the only boinc project to suffer raid disk array issues). ID: 67362 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 67455 - Posted: 9 Jan 2023, 11:01:38 UTC Upload server update 9/1/23 10:49GMT From a meeting this morning with CPDN they do not expect the upload server to be available until 17:00GMT TOMORROW (10th) at the earliest. The server itself is running, but they have to move many Tbs of data but also want to monitor the newly configured server to check it is stable. As already said, these are issues caused by the cloud provider, not CPDN themselves. ID: 67455 · Reply Quote

gemini8 Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,562,405 RAC: 1,841	Message 67462 - Posted: 9 Jan 2023, 17:00:40 UTC I don't care for the content of this info, but thanks anyway. ;-) - - - - - - - - - - Greetings, Jens ID: 67462 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 67507 - Posted: 10 Jan 2023, 16:00:15 UTC Upload server status: 10/Jan 16:00GMT Just spoken with CPDN. They have successfully migrated 1,000,000 of the 1,300,000 files onto the new block storage for the upload server. That process should be complete today but they will run checks first before opening up the upload server. I'll get an update tomorrow. ID: 67507 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,836,638 RAC: 3,986	Message 67510 - Posted: 10 Jan 2023, 16:14:25 UTC - in response to Message 67507. Thank you Glenn for the regular communication and updates. As an IT programme director in a previous life 'communicate, communicate, communicate' was the most important requirement for an effective programme team, ID: 67510 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 67511 - Posted: 10 Jan 2023, 18:22:06 UTC - in response to Message 67510. Thank you Glenn for the regular communication and updates. As an IT programme director in a previous life 'communicate, communicate, communicate' was the most important requirement for an effective programme team, Absolutely. Before retirement, in my role at ECMWF I managed the international project to provide their IFS model as a community model (i.e. OpeniFS). Good communication was essential to make that work both internally (the hardest part!) and externally, not least because there's an amazing talent pool of users willing to get involved if they have the right means to do so. To their credit, CPDN know it's important, they just don't have enough resources to do it. in fact, they have barely any resources! They are quite dependent on volunteers helping out, which they acknowledge. ID: 67511 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 67515 - Posted: 10 Jan 2023, 19:30:45 UTC - in response to Message 67511. To their credit, CPDN know it's important, they just don't have enough resources to do it. in fact, they have barely any resources! They are quite dependent on volunteers helping out, which they acknowledge. How much processing does the upload pipeline actually take? Would a random desktop with a software RAID of USB3 external drives be a viable backup for when the cloud falls to the ground again? Or even just a bunch of internal SATA drives. It's clear that the cloud provider is... less competent than desired, so having an alternative backup server that could easily get expanded storage capacity if needed would keep the volunteer CPUs purring away through disruptions. ID: 67515 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 67516 - Posted: 10 Jan 2023, 20:11:37 UTC - in response to Message 67515. To their credit, CPDN know it's important, they just don't have enough resources to do it. in fact, they have barely any resources! They are quite dependent on volunteers helping out, which they acknowledge. How much processing does the upload pipeline actually take? Would a random desktop with a software RAID of USB3 external drives be a viable backup for when the cloud falls to the ground again? Or even just a bunch of internal SATA drives. It's clear that the cloud provider is... less competent than desired, so having an alternative backup server that could easily get expanded storage capacity if needed would keep the volunteer CPUs purring away through disruptions. My back of an envelope (mental arithmetic) calculation is that the uploads from the 40 batches will total about 80TB before taking into account retreads. Not sure what the maximum transfer rate required would be. ID: 67516 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085	Message 67517 - Posted: 10 Jan 2023, 20:19:45 UTC - in response to Message 67516. Last modified: 10 Jan 2023, 20:23:43 UTC Glenn mentioned that the upload server has got 25 TB storage attached. My understanding is that data are moved off to other storage in time. My guess is that the critical requirements on the upload server's storage subsystem are high IOPS and perhaps limited latency. (Accompanied by, of course, data integrity = error detection and correction, as a common requirement on file servers.) ID: 67517 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 67518 - Posted: 10 Jan 2023, 20:30:31 UTC - in response to Message 67516. My back of an envelope (mental arithmetic) calculation is that the uploads from the 40 batches will total about 80TB before taking into account retreads. Not sure what the maximum transfer rate required would be. 16TB USB3 external drives are under $300 on NewEgg. A set of 10 of those, RAID6 with a hot spare or two, would be plenty of space, and a couple USB3 controllers and hubs would get you reasonable performance from a bog standard desktop. Then duplicate them and ship a box of them... I've done roughly this before for some large data sets I was working with. ID: 67518 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 67519 - Posted: 10 Jan 2023, 20:45:54 UTC - in response to Message 67518. I've done roughly this before for some large data sets I was working with. This sounds like you are doing large sequential reads. It could be very different for seeking a lot of 15MB files while doing both read and writes. That would require good random I/O and latency. Anyway, guessing this isn't all that useful unless the team publishes actual workload telemetry, at minimum including read/write IOPS and byte rate, ideally also some trace to show the pattern. ID: 67519 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 67520 - Posted: 10 Jan 2023, 22:04:21 UTC Glenn did also report (message 67362): The upload server will be restarted but with reduced max http connections What we don't yet know is how that reduced number of concurrent connections will compare with the number of individual uploads being attempted simultaneously by the massed ranks of CPDN volunteers. If it's fewer, connection attempts will be rejected, BOINC backoffs will follow, and the whole process will be slowed. If the number of concurrent connections is significantly lower than the number of simultaneous upload attempts (not computers attempting to upload), the congestion will be even worse. ID: 67520 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 67522 - Posted: 10 Jan 2023, 22:30:17 UTC Last modified: 10 Jan 2023, 22:42:18 UTC Update. 22:30. 10/Jan Update from CPDN. The data move has been completed and the upload server has been enabled. Uploads should get moving again. The previous message & comment about restricting http connections is old news and referred to an earlier attempt to restart the original configuration. This is a new implementation of the data store for the upload server which does not have any restrictions. ID: 67522 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 67523 - Posted: 10 Jan 2023, 22:42:41 UTC Last modified: 10 Jan 2023, 22:46:34 UTC I tried 'retry transfer' - it tried 2x2, and they all went into 'project backoff'. Event Log reports: 10/01/2023 22:44:20 \| climateprediction.net \| [http] [ID#15306] Info: connect to 192.171.169.187 port 80 failed: No route to host 10/01/2023 22:44:20 \| climateprediction.net \| [http] [ID#15306] Info: Failed to connect to upload11.cpdn.org port 80: No route to host 10/01/2023 22:44:20 \| climateprediction.net \| [http] HTTP error: Couldn't connect to server ID: 67523 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 67524 - Posted: 10 Jan 2023, 22:50:06 UTC - in response to Message 67523. Last modified: 10 Jan 2023, 23:14:06 UTC I'm seeing 'no route to host' errors too. Possible the DNS cache needs updating on our machines (or their end) for the new incarnation. Maybe something at upload server needs re-enabling. I've let them know. Should be a quick one to sort out. Edit: actually that's not entirely accurate. I can ping upload11.cpdn.org and 'traceroute upload11.cpdn.org' also works. But port 80 doesn't appear to be open: $ echo > /dev/tcp/192.171.169.187/80 && echo 'port open' bash: connect: No route to host bash: /dev/tcp/192.171.169.187/80: No route to host Should be quick fix come office hrs. ID: 67524 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 67530 - Posted: 11 Jan 2023, 7:32:54 UTC Possibly a 'secure connections only' policy, or DDOS protection - either in the server or firewall? ID: 67530 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 67534 - Posted: 11 Jan 2023, 10:43:49 UTC Just had confirmation from CPDN that the upload server is now fully functional. ID: 67534 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 67788 - Posted: 17 Jan 2023, 10:03:06 UTC Last modified: 17 Jan 2023, 10:05:21 UTC Upcoming work... There are two other OpenIFS projects with work about ready to go. One will be the OpenIFS 'BL: baroclinic lifecycle' app. This looks at idealized storms in a changing climate. The model runs are much shorter than the OpenIFS PS app. The other project uses the standard OpenIFS model for some atmospheric perturbations studies. Neither of these will involve as many batches as the current PS app. Release of these workunits is pending testing of some code changes I'm making following feedback & study of the issues arising from the OpenIFS PS app batches. In short, they'll be no shortage of OpenIFS work for some time. And just a reminder, please do not over-provision memory for these OpenIFS tasks and if you have a low memory machine (virtual or real), 8Gb or less, only allow 1 task at a time, or best use an app_config.xml file to control how many OpenIFS tasks are started simultaneously. The boinc client does not understand the memory needs of these tasks well enough and can start too many at once crashing the tasks. ID: 67788 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 67805 - Posted: 17 Jan 2023, 14:43:01 UTC - in response to Message 67788. And just a reminder, please do not over-provision memory for these OpenIFS tasks and if you have a low memory machine (virtual or real), 8Gb or less, only allow 1 task at a time, or best use an app_config.xml file to control how many OpenIFS tasks are started simultaneously. The boinc client does not understand the memory needs of these tasks well enough and can start too many at once crashing the tasks. At the moment, my app_config.xml file for CPDN is like this. [/var/lib/boinc/projects/climateprediction.net]$ cat app_config.xml <app_config> <project_max_concurrent>6</project_max_concurrent> <app> <name>oifs_43r3_bl</name> <max_concurrent>1</max_concurrent> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>5</max_concurrent> </app> <app> <name>oifs_43r3</name> <max_concurrent>1</max_concurrent> </app> </app_config> Lately it has been running all 5 oifs_43r3_ps at a time, and once it also ran one oifs_43r3_bl at the same time. Also, it ran the 5 5 oifs_43r3_ps and one hadsm4 at the same time. It usually runs other Boinc tasks from other projects too. I have been allowing the machine to use 12 cores of my 16 core machine. My Ram usage is like this. It looks like this will be enough. At the time it was said that there would be lots more _fs tasks than _bl tasks. Do you think, once the new _bl tasks become available, I should change the balance between the two? $ free -hw total used free shared buffers cache available Mem: 62Gi 21Gi 3.9Gi 98Mi 110Mi 36Gi 39Gi ID: 67805 · Reply Quote