climateprediction.net (CPDN) home page
Thread 'OpenIFS Discussion'

Thread 'OpenIFS Discussion'

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 32 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 67360 - Posted: 5 Jan 2023, 10:57:55 UTC - in response to Message 67351.  

One of my boxes completed all uploads, the second however is now stuck with 2(sic) out of those thousands result files. Is/will the "gate" be opened permanently? After all, it seems to be the sole prupose of an upload server to be open for uploads?
CPDN are talking to JASMIN who provide the cloud storage and servers for CPDN to manage. I don't think JASMIN have worked out exactly what the problem is yet. A fast connection would have let me clear everything before uploads stopped again. Given that they got it up and running yesterday, I am hopeful that later today it will be working again. I don't know if like yesterday, some data needs to be moved first. I am hoping not as David has asked for another 25GB of storage to run alongside what is already there. I am pretty certain that JASMIN does not routinely deal with the number of connections involved in this project or the amount of data involved, certainly not both together which I suspect is the root of the problem.

On another subject, away from the upload problems I notice I have two zips that are zero bytes in length from different tasks.
ID: 67360 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67362 - Posted: 5 Jan 2023, 14:20:30 UTC
Last modified: 5 Jan 2023, 14:22:35 UTC

Just out of a meeting with CPDN. They understand what the problem is. The upload server will be restarted but with reduced max http connections to keep it stable as best as possible. At some point very soon (today/tomorrow) they will move the upload server to a new RAID disk array (which is what's causing the problem). They may decide to do the move before restarting depending on how quickly the JASMIN cloud provider can give them the temporary space they need to move the files whilst setting up the new server.

Regarding JASMIN's capacity for no. of simultaneous connections, I'm told these OpenIFS batches are not the highest load CPDN have ever seen and JASMIN has plenty of capacity. It's just the underlying disk system that was the issue (it's not the only boinc project to suffer raid disk array issues).
ID: 67362 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67455 - Posted: 9 Jan 2023, 11:01:38 UTC

Upload server update 9/1/23 10:49GMT
From a meeting this morning with CPDN they do not expect the upload server to be available until 17:00GMT TOMORROW (10th) at the earliest. The server itself is running, but they have to move many Tbs of data but also want to monitor the newly configured server to check it is stable. As already said, these are issues caused by the cloud provider, not CPDN themselves.
ID: 67455 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,476,194
RAC: 1,633
Message 67462 - Posted: 9 Jan 2023, 17:00:40 UTC

I don't care for the content of this info, but thanks anyway. ;-)
- - - - - - - - - -
Greetings, Jens
ID: 67462 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67507 - Posted: 10 Jan 2023, 16:00:15 UTC

Upload server status: 10/Jan 16:00GMT
Just spoken with CPDN. They have successfully migrated 1,000,000 of the 1,300,000 files onto the new block storage for the upload server. That process should be complete today but they will run checks first before opening up the upload server. I'll get an update tomorrow.
ID: 67507 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,325,064
RAC: 10,197
Message 67510 - Posted: 10 Jan 2023, 16:14:25 UTC - in response to Message 67507.  

Thank you Glenn for the regular communication and updates.

As an IT programme director in a previous life 'communicate, communicate, communicate' was the most important requirement for an effective programme team,
ID: 67510 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67511 - Posted: 10 Jan 2023, 18:22:06 UTC - in response to Message 67510.  

Thank you Glenn for the regular communication and updates.

As an IT programme director in a previous life 'communicate, communicate, communicate' was the most important requirement for an effective programme team,
Absolutely. Before retirement, in my role at ECMWF I managed the international project to provide their IFS model as a community model (i.e. OpeniFS). Good communication was essential to make that work both internally (the hardest part!) and externally, not least because there's an amazing talent pool of users willing to get involved if they have the right means to do so.

To their credit, CPDN know it's important, they just don't have enough resources to do it. in fact, they have barely any resources! They are quite dependent on volunteers helping out, which they acknowledge.
ID: 67511 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 67515 - Posted: 10 Jan 2023, 19:30:45 UTC - in response to Message 67511.  

To their credit, CPDN know it's important, they just don't have enough resources to do it. in fact, they have barely any resources! They are quite dependent on volunteers helping out, which they acknowledge.


How much processing does the upload pipeline actually take? Would a random desktop with a software RAID of USB3 external drives be a viable backup for when the cloud falls to the ground again? Or even just a bunch of internal SATA drives. It's clear that the cloud provider is... less competent than desired, so having an alternative backup server that could easily get expanded storage capacity if needed would keep the volunteer CPUs purring away through disruptions.
ID: 67515 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 67516 - Posted: 10 Jan 2023, 20:11:37 UTC - in response to Message 67515.  

To their credit, CPDN know it's important, they just don't have enough resources to do it. in fact, they have barely any resources! They are quite dependent on volunteers helping out, which they acknowledge.


How much processing does the upload pipeline actually take? Would a random desktop with a software RAID of USB3 external drives be a viable backup for when the cloud falls to the ground again? Or even just a bunch of internal SATA drives. It's clear that the cloud provider is... less competent than desired, so having an alternative backup server that could easily get expanded storage capacity if needed would keep the volunteer CPUs purring away through disruptions.


My back of an envelope (mental arithmetic) calculation is that the uploads from the 40 batches will total about 80TB before taking into account retreads. Not sure what the maximum transfer rate required would be.
ID: 67516 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67517 - Posted: 10 Jan 2023, 20:19:45 UTC - in response to Message 67516.  
Last modified: 10 Jan 2023, 20:23:43 UTC

Glenn mentioned that the upload server has got 25 TB storage attached. My understanding is that data are moved off to other storage in time.

My guess is that the critical requirements on the upload server's storage subsystem are high IOPS and perhaps limited latency. (Accompanied by, of course, data integrity = error detection and correction, as a common requirement on file servers.)
ID: 67517 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 67518 - Posted: 10 Jan 2023, 20:30:31 UTC - in response to Message 67516.  

My back of an envelope (mental arithmetic) calculation is that the uploads from the 40 batches will total about 80TB before taking into account retreads. Not sure what the maximum transfer rate required would be.


16TB USB3 external drives are under $300 on NewEgg. A set of 10 of those, RAID6 with a hot spare or two, would be plenty of space, and a couple USB3 controllers and hubs would get you reasonable performance from a bog standard desktop. Then duplicate them and ship a box of them... I've done roughly this before for some large data sets I was working with.
ID: 67518 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,625,512
RAC: 59,849
Message 67519 - Posted: 10 Jan 2023, 20:45:54 UTC - in response to Message 67518.  

I've done roughly this before for some large data sets I was working with.

This sounds like you are doing large sequential reads. It could be very different for seeking a lot of 15MB files while doing both read and writes. That would require good random I/O and latency. Anyway, guessing this isn't all that useful unless the team publishes actual workload telemetry, at minimum including read/write IOPS and byte rate, ideally also some trace to show the pattern.
ID: 67519 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,698,338
RAC: 10,100
Message 67520 - Posted: 10 Jan 2023, 22:04:21 UTC

Glenn did also report (message 67362):

The upload server will be restarted but with reduced max http connections
What we don't yet know is how that reduced number of concurrent connections will compare with the number of individual uploads being attempted simultaneously by the massed ranks of CPDN volunteers. If it's fewer, connection attempts will be rejected, BOINC backoffs will follow, and the whole process will be slowed. If the number of concurrent connections is significantly lower than the number of simultaneous upload attempts (not computers attempting to upload), the congestion will be even worse.
ID: 67520 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67522 - Posted: 10 Jan 2023, 22:30:17 UTC
Last modified: 10 Jan 2023, 22:42:18 UTC

Update. 22:30. 10/Jan
Update from CPDN. The data move has been completed and the upload server has been enabled. Uploads should get moving again.

The previous message & comment about restricting http connections is old news and referred to an earlier attempt to restart the original configuration. This is a new implementation of the data store for the upload server which does not have any restrictions.
ID: 67522 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,698,338
RAC: 10,100
Message 67523 - Posted: 10 Jan 2023, 22:42:41 UTC
Last modified: 10 Jan 2023, 22:46:34 UTC

I tried 'retry transfer' - it tried 2x2, and they all went into 'project backoff'.

Event Log reports:

10/01/2023 22:44:20 | climateprediction.net | [http] [ID#15306] Info:  connect to 192.171.169.187 port 80 failed: No route to host
10/01/2023 22:44:20 | climateprediction.net | [http] [ID#15306] Info:  Failed to connect to upload11.cpdn.org port 80: No route to host
10/01/2023 22:44:20 | climateprediction.net | [http] HTTP error: Couldn't connect to server
ID: 67523 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67524 - Posted: 10 Jan 2023, 22:50:06 UTC - in response to Message 67523.  
Last modified: 10 Jan 2023, 23:14:06 UTC

I'm seeing 'no route to host' errors too. Possible the DNS cache needs updating on our machines (or their end) for the new incarnation. Maybe something at upload server needs re-enabling.

I've let them know. Should be a quick one to sort out.

Edit: actually that's not entirely accurate. I can ping upload11.cpdn.org and 'traceroute upload11.cpdn.org' also works. But port 80 doesn't appear to be open:
$ echo > /dev/tcp/192.171.169.187/80 && echo 'port open'
bash: connect: No route to host
bash: /dev/tcp/192.171.169.187/80: No route to host
Should be quick fix come office hrs.
ID: 67524 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,698,338
RAC: 10,100
Message 67530 - Posted: 11 Jan 2023, 7:32:54 UTC

Possibly a 'secure connections only' policy, or DDOS protection - either in the server or firewall?
ID: 67530 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67534 - Posted: 11 Jan 2023, 10:43:49 UTC

Just had confirmation from CPDN that the upload server is now fully functional.
ID: 67534 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67788 - Posted: 17 Jan 2023, 10:03:06 UTC
Last modified: 17 Jan 2023, 10:05:21 UTC

Upcoming work...

There are two other OpenIFS projects with work about ready to go. One will be the OpenIFS 'BL: baroclinic lifecycle' app. This looks at idealized storms in a changing climate. The model runs are much shorter than the OpenIFS PS app. The other project uses the standard OpenIFS model for some atmospheric perturbations studies. Neither of these will involve as many batches as the current PS app.

Release of these workunits is pending testing of some code changes I'm making following feedback & study of the issues arising from the OpenIFS PS app batches.

In short, they'll be no shortage of OpenIFS work for some time.

And just a reminder, please do not over-provision memory for these OpenIFS tasks and if you have a low memory machine (virtual or real), 8Gb or less, only allow 1 task at a time, or best use an app_config.xml file to control how many OpenIFS tasks are started simultaneously. The boinc client does not understand the memory needs of these tasks well enough and can start too many at once crashing the tasks.
ID: 67788 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67805 - Posted: 17 Jan 2023, 14:43:01 UTC - in response to Message 67788.  

And just a reminder, please do not over-provision memory for these OpenIFS tasks and if you have a low memory machine (virtual or real), 8Gb or less, only allow 1 task at a time, or best use an app_config.xml file to control how many OpenIFS tasks are started simultaneously. The boinc client does not understand the memory needs of these tasks well enough and can start too many at once crashing the tasks.


At the moment, my app_config.xml file for CPDN is like this.
[/var/lib/boinc/projects/climateprediction.net]$ cat app_config.xml 
<app_config>
    <project_max_concurrent>6</project_max_concurrent>
    <app>
        <name>oifs_43r3_bl</name>
        <max_concurrent>1</max_concurrent>
        </app>
    <app>
        <name>oifs_43r3_ps</name>
        <max_concurrent>5</max_concurrent>
        </app>
    <app>
        <name>oifs_43r3</name>
        <max_concurrent>1</max_concurrent>
        </app>
</app_config>


Lately it has been running all 5 oifs_43r3_ps at a time, and once it also ran one oifs_43r3_bl at the same time. Also, it ran the 5 5 oifs_43r3_ps and one hadsm4 at the same time. It usually runs other Boinc tasks from other projects too. I have been allowing the machine to use 12 cores of my 16 core machine. My Ram usage is like this. It looks like this will be enough. At the time it was said that there would be lots more _fs tasks than _bl tasks.

Do you think, once the new _bl tasks become available, I should change the balance between the two?

$ free -hw
              total        used        free      shared     buffers       cache   available
Mem:           62Gi        21Gi       3.9Gi        98Mi       110Mi        36Gi        39Gi

ID: 67805 · Report as offensive     Reply Quote
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 32 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 cpdn.org