climateprediction.net (CPDN) home page
Thread 'OpenIFS Discussion'

Thread 'OpenIFS Discussion'

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 32 · Next

AuthorMessage
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 66642 - Posted: 29 Nov 2022, 23:31:44 UTC

ADSL in the US can be very, very badly asymmetric. Not that many years back, I had 25Mbit down, 768kbit up (yes, not even 1Mbit up).

Are the current OpenIFS models supposed to be multithreaded? If so, is there an incantation to make them do it? Mine seem to be just single thread right now.
ID: 66642 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 66644 - Posted: 29 Nov 2022, 23:56:31 UTC

Hi!
Finally got boinc running again but only the 7.16.6 from the repos on Mint 20.3 in my limited system directory
Error: " climateprediction.net: Notice from server
OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB.
Tue 29 Nov 2022 09:56:52 AM CET"
38 Gb sounds alot, normal? Asteroids@Home runs fine
ID: 66644 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66646 - Posted: 30 Nov 2022, 1:06:23 UTC - in response to Message 66644.  

Finally got boinc running again but only the 7.16.6 from the repos on Mint 20.3 in my limited system directory
Error: " climateprediction.net: Notice from server
OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB.
Tue 29 Nov 2022 09:56:52 AM CET"
38 Gb sounds alot, normal? Asteroids@Home runs fine


Sounds mighty greedy to me. I am running 5 projects and Boinc is currently using only 6.64 GBytes. My CPDN is running three OIFS and CPDN is using only 4.36 GBytes

Here is one that finished recently.

Application version 	OpenIFS 43r3 Perturbed Surface v1.01
                        x86_64-pc-linux-gnu
Peak working set size 	4,769.45 MB
Peak disk usage 	1,220.29 MB

ID: 66646 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 66647 - Posted: 30 Nov 2022, 1:33:47 UTC

Computation error: oifs_43r3_ps_08078_221050100_123945
cracks up in the end, two trickles refuse to upload
oops, after along wait they disappeard from the upload list
ID: 66647 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66648 - Posted: 30 Nov 2022, 2:55:38 UTC - in response to Message 66641.  

Is it the amount of trickles that's an issue? Or the total amount of data? The model output for the complete forecast is split into the smaller trickle files (to ease the data upload burden). We could do fewer trickles but the total data size would be the same (each trickle would be larger).


I wonder how the Internet links, and Internet Service Providers supporting the CPDN community are distributed. It seems to me that the problem, if there really is a problem, is the product of the number of trickles times the size of a trickle. So making the trickles larger so as to send fewer of them would make no difference. But that is just my intuition; I have no data to back up my opinion.

I'm assuming it's the total size of the upload (sum of all trickle sizes) that's a problem? We can ask the scientist to reduce the model output if necessary.


Since, if I do not have trouble uploading my trickles as things are now (one trickle per work unit every 8 minutes or so, and a trickle takes an average of 5 seconds to send), I sure do not want the model output to be reduced on my account. But I have a wide-band fiber-optic Internet connection and am running only 3 OIFS work units at a time.

And what happens if some high powered user has 10 computers each running 10 work-units on the same Internet connection? And if that is not enough, what if the models start running multi-processing work units like some Milky-way ones do?

On the one hand, one could argue that they could just give up on users with slow connections. That seems a shame and unfair. On the other hand, since at least some users have no trouble with their Internet connections easily keeping up with the (current) demands of sending the trickles, there should be no reason to compromise the amount of data they are allowed to send..

Has the project received sufficient work to know and understand the population's Internet connections so a reasonable policy can be determined?
ID: 66648 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 66649 - Posted: 30 Nov 2022, 6:19:39 UTC - in response to Message 66637.  

I've got a few of these new units. So far two completed ok and two with errors.
The first error log ends with:
Uploading trickle at timestep: 1900800
00:22:36 STEP 530 H= 530:00 +CPU= 15.541
double free or corruption (out)

The other:
18:58:37 STEP 482 H= 482:00 +CPU= 10.168
free(): invalid next size (fast)
Ah! Excellent. I've been trying to understand why some tasks are apparently stopping with nothing in the stderr.txt returned to the server to explain why it stopped.

@DarkAngel - can you tell me which resultids those were so I can look them up?
Also, what machine & OS are you using these on?

This kind of error message indicates a memory problem, often caused by a bug in the code but I've also seen it caused by certain versions of compilers/system libraries. I've never seen it with the model itself but then I've never run the model on such a wide range of systems like this. Could also be the wrapper code we use.

Quick question. When the tasks are running, if you do 'ps -ef' you should see the same number of 'master.exe' processes as 'oifs_43r3_ps_1.01_x86_64-pc-linux-gnu'. The latter is the 'controller' for the model itself (master.exe). Do you have the same number of each? I ask because we know of one issue that can kill the 'oifs_43r3....' process running but still leave the model 'master.exe' running.

Thanks for your help.


https://www.cpdn.org/show_host_detail.php?hostid=1534740

$ uname -a
Linux Zen 5.15.0-53-generic #59~20.04.1-Ubuntu SMP Thu Oct 20 15:10:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

https://www.cpdn.org/result.php?resultid=22245900
https://www.cpdn.org/result.php?resultid=22245702

I have the same number of each running at the moment.
ID: 66649 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66650 - Posted: 30 Nov 2022, 7:23:35 UTC
Last modified: 30 Nov 2022, 7:25:52 UTC

On the issue of result data size:

I have a few dual-processor Xeon and Epyc computers at home, but at the same time a simple cable modem connection. I am observing an upload speed at the order of 8 Mbit/s and figure that the result rate from a single 32-core Epyc already saturates this link. That is, I can only put a fraction of my CPUs to OpenIFS in steady state due to my upload bandwidth limitation.


On computation errors:

None so far here, with this exception: I have OpenIFS on three computers for now (but see above) and took the risk of rebooting one of them. This were the steps which I took; automated: Switch "leave non-GPU tasks in memory while suspended" from on to off. For each task in order of ascending elapsed time, request to suspend the task and wait until it did. Shut down client. Reboot computer. Revert the prior steps in reverse order. After that, the tasks resumed seemingly OK at first, but exited with computation error later. Alas I didn't take note if only some or all of the tasks which went through the suspend-resume cycle were affected. Example: task 22245231 (oifs_43r3_ps_0199_2021050100_123_945_12163288_0). — Next time when I want to reboot a computer with OpenIFS on it, I'll let these tasks complete first, which is manageable as long as the task durations are as short as with the current work.
ID: 66650 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,845,098
RAC: 19,856
Message 66651 - Posted: 30 Nov 2022, 7:38:56 UTC - in response to Message 66650.  

... request to suspend the task and wait until it did. ...

At what time point (BOINC elapsed time) did you suspend the referenced task, about 16 minutes ?
ID: 66651 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 66652 - Posted: 30 Nov 2022, 8:41:06 UTC
Last modified: 30 Nov 2022, 8:51:06 UTC

At least for me, the issue with uploads is the total amount of data. After a while running just two at once, I ended up with a massive queue. I will let my machine carry on working once I have cleared some of the backlog. Three completed, 2 failed though nothing I can see in stderr to explain it. Both uploaded final zips. Another is uploading. I have stopped tasks running as new zips seem to be jumping the queue.

Edit: I have something over 300zips waiting in the queue!
ID: 66652 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 66653 - Posted: 30 Nov 2022, 8:44:09 UTC

Just thinking, if there are problems with some versions of GCC on hosts, maybe long term the answer would be to have Linux hosts as well as Windows ones use VB?
ID: 66653 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 66654 - Posted: 30 Nov 2022, 9:17:04 UTC

The metric we need to be looking at is the total average data flow rate. That's the integral of the upload file size over time. The problem arises if the time taken to upload a single file is longer than the interval between file production. Making the files 'bigger but further apart' won't help that average speed.

I'm currently running 8 tasks - 4 each on two machines - and uploading them all over the same broadband link. I'm generating more than 1 file per minute, but each file uploads in 10 or 11 seconds. My line can sustain that indefinitely - it's fibre running at 79 Mbps download, 15.8 Mbps upload - but if the Oxford servers take a break, there'll be one heck of a car-crash.
ID: 66654 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,845,098
RAC: 19,856
Message 66655 - Posted: 30 Nov 2022, 9:26:46 UTC

I haven't seen a single upload yet (they show as going through in Event Log though) but I do have a good internet connection.

My BOINC seems to be learning quickly as estimated completion time for OpenIFS has gone down from 2+ days to just under 11h:45m after 1 completed but failed task that ran just over 14 hours and 1 newly downloaded task.

Unfortunately, after looking at error files, I expect to have more failed OpenIFS tasks. I do have a developing hypothesis for a possible connection for at least some of the failures.
ID: 66655 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 69,923,532
RAC: 8,011
Message 66656 - Posted: 30 Nov 2022, 10:06:25 UTC - in response to Message 66642.  

I bought 64 GB RAM for one of my Linux Computers, so I would able to process in parallel more of the new OpenIFS WUs. Has there be even mentioned, that each WU will generate about 1 GB of tickles in the previous conversation in preparation of this OpenIFS WUs?

ADSL:
ADSL in the US can be very, very badly asymmetric. Not that many years back, I had 25Mbit down, 768kbit up (yes, not even 1Mbit up).
Same here (2Mbit down, 768kbit up)! I am not able to change to something faster as I work with it! I have 3 working LINUX computers behind this ADSL line and intended to start 2 more. But this will not happen, I am not even able to upload tickles of 2 parallel running WUs.
That is, I can only put a fraction of my CPUs to OpenIFS in steady state due to my upload bandwidth limitation.
This wraps it up nicely! This data volume is not sustainable! Now it is not 32bit-libs, now it is bandwidth!
Error: " climateprediction.net: Notice from server
OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB.
Tue 29 Nov 2022 09:56:52 AM CET"
38 Gb sounds alot, normal?
Same problem with my WSL2 installation – this computer would not have a pronounced bandwidth problem, but does not download any tasks as OpenIFS asks constantly for 38146.97 MB! climateprediction.net project-folder on the Linux Computer needs only 8.1 GB!

I think less tickles with around 100MB each would be easier for BOINC to handle. But there are more knowledgable participants in the forum. And yes the overall size should be reduced to something more manageable as bandwidth limiting the throughput of WUs over all participants.
OpenIFS Errors:
https://www.cpdn.org/result.php?resultid=22245289
https://www.cpdn.org/result.php?resultid=22245630
After adjusting app_config from max 4 WUs to 1 WU and afterwards to 2 WUs.
ID: 66656 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66657 - Posted: 30 Nov 2022, 10:18:41 UTC - in response to Message 66654.  

The metric we need to be looking at is the total average data flow rate. That's the integral of the upload file size over time. The problem arises if the time taken to upload a single file is longer than the interval between file production. Making the files 'bigger but further apart' won't help that average speed.
That's what wasn't clear to me - whether the sheer number of zipfiles (Dave reported having over 300) could cause problems for the client & filespace, or whether the dataflow was the issue.

We can reduce the data output from the model, but maybe we need to control how many tasks run at once on a volunteer machine? Anyone saturating their threadrippers with OpenIFS tasks is going to have a big outgoing queue. It's not so easy to control what the volunteers do. I'm open to suggestions.

Am about to post on the 'New work discussion 2' thread about other ongoing problems with the OIFS tasks as people are reporting problems there too.
ID: 66657 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66658 - Posted: 30 Nov 2022, 10:29:43 UTC - in response to Message 66656.  

38Gb is very high. I will look into this. I am running multiple tasks and not seeing anything like this number. I take your point about ADSL. I've been on fibre for so long I'd forgotten about these speeds.

Obviously there should have been more testing on a larger scale. Apologies for the inconvenience.

Error: " climateprediction.net: Notice from server
OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB.
Tue 29 Nov 2022 09:56:52 AM CET"
38 Gb sounds alot, normal?
Same problem with my WSL2 installation – this computer would not have a pronounced bandwidth problem, but does not download any tasks as OpenIFS asks constantly for 38146.97 MB! climateprediction.net project-folder on the Linux Computer needs only 8.1 GB!

I think less tickles with around 100MB each would be easier for BOINC to handle. But there are more knowledgable participants in the forum. And yes the overall size should be reduced to something more manageable as bandwidth limiting the throughput of WUs over all participants.
OpenIFS Errors:
https://www.cpdn.org/result.php?resultid=22245289
https://www.cpdn.org/result.php?resultid=22245630
After adjusting app_config from max 4 WUs to 1 WU and afterwards to 2 WUs.
ID: 66658 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 66659 - Posted: 30 Nov 2022, 10:36:55 UTC
Last modified: 30 Nov 2022, 11:01:03 UTC

38Gb is very high. I will look into this. I am running multiple tasks and not seeing anything like this number. I take your point about ADSL. I've been on fibre for so long I'd forgotten about these speeds.
Strange, with 9 tasks in progress and another 21 in the queue CPDN is only using up 18.25GB. I could try reducing the amount of disk available to BOINC to see when it starts to object. (Currently the disk tab is showing 120GB free and available to BOINC.)

Edit: Setting BOINC to use no more than 28GB leaving only 5GB free available to BOINC still lets OpenIFS tasks run without problems.
ID: 66659 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 66661 - Posted: 30 Nov 2022, 11:10:07 UTC - in response to Message 66659.  

There are possibly two separate issues here.

1) On the server, when it's considering whether a task can be issued to the host requesting it.
2) On the local host, when the client is deciding whether an allocated and downloaded task can be run.

Either or both may be related to the XML specification for the workunit, which contains

<rsc_disk_bound> 40,000,000,000
My calculator makes that 37.25 GB, using 1024x steps.
ID: 66661 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 66662 - Posted: 30 Nov 2022, 11:48:37 UTC - in response to Message 66656.  

Error: " climateprediction.net: Notice from server
OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB.
Tue 29 Nov 2022 09:56:52 AM CET"
38 Gb sounds alot, normal?


I have the same problem, but it may be because Store at was set to 10 days and Store up to a 10 days (so at MAX values). So the server might be trying to push more WUs than the available space for Storage. I've changed these to see if I will get any WU.
ID: 66662 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 69,923,532
RAC: 8,011
Message 66663 - Posted: 30 Nov 2022, 11:58:40 UTC - in response to Message 66661.  

Either or both may be related to the XML specification for the workunit, which contains:
<rsc_disk_bound> 40,000,000,000
My calculator makes that 37.25 GB, using 1024x steps.
Glenn, this is the problem, the model asks for too much disk space before it even starts to download a OpenIFS Wu. It is not, as you have written in the other thread, that the model has already downloaded OpenIFS Wus and after several crashes has run out of allocated disk space in WSL2. BOINC simple sees there is not sufficient space allowed on the hard-disk on a particular computer and therefor refuses starting to download a WU.
I have already checked if my WSL2 disk is full with crashed models and there is none! As far as I understand the researcher has to reduce this to the size it is really needed:
<rsc_disk_bound> 40,000,000,000
It might be just one decimal less.

In the meanwhile another model crashed. As there has been a “traffic jam” with another BOINC project (SIDOCK produces small WUs with a huge data-output, and releases these once a day) climateprediction.net tickles have not been uploaded in time. OpenIFS reports no tickles – they have been waiting in line…

I have another question: Once the multicore models start to be released, the tickles will be even larger and appear in a higher frequency or is there no direct link between multithreaded and tickle size/frequency?
ID: 66663 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66664 - Posted: 30 Nov 2022, 14:28:10 UTC - in response to Message 66663.  
Last modified: 30 Nov 2022, 14:38:12 UTC

Great, thanks for checking that. 37Gb is too high. This value is not set by the scientist but by the CPDN team. I do recall a conversation with Andy where he said he 'guess-timates' these sizes and in the past has wondered why the tasks are never sent by the server. I've also found I had to remove any disk limits in my boincmgr in order to start getting tasks.

I will bring this up with CPDN, however, I'll run it standalone and see how much it produces. That covers the worse case scenario where all the model output is stored locally until the model ends.

Multi-core. For the same model resolution the trickle size will be the same (unless we change it). The trickle production rate will increase by roughly the same number of cores.

The real reason to go multi-core is for the higher resolution model configurations I've discussed with CPDN in order to complete them faster. Not the lower resolutions CPDN is currently running. That could mean increases in file sizes but not necessarily. We could consider running multiple batches with different output to complete the full dataset output the scientist wants.

Regardless of the no. of cores, we can still adjust trickle rate & size (or data-flow rate as Richard eloquently puts it - put another way, data-production rate from the model). We are some way off multi-core at the moment :D

Either or both may be related to the XML specification for the workunit, which contains:
<rsc_disk_bound> 40,000,000,000
My calculator makes that 37.25 GB, using 1024x steps.
Glenn, this is the problem, the model asks for too much disk space before it even starts to download a OpenIFS Wu. It is not, as you have written in the other thread, that the model has already downloaded OpenIFS Wus and after several crashes has run out of allocated disk space in WSL2. BOINC simple sees there is not sufficient space allowed on the hard-disk on a particular computer and therefor refuses starting to download a WU.
I have already checked if my WSL2 disk is full with crashed models and there is none! As far as I understand the researcher has to reduce this to the size it is really needed:
<rsc_disk_bound> 40,000,000,000
It might be just one decimal less.

In the meanwhile another model crashed. As there has been a “traffic jam” with another BOINC project (SIDOCK produces small WUs with a huge data-output, and releases these once a day) climateprediction.net tickles have not been uploaded in time. OpenIFS reports no tickles – they have been waiting in line…

I have another question: Once the multicore models start to be released, the tickles will be even larger and appear in a higher frequency or is there no direct link between multithreaded and tickle size/frequency?
ID: 66664 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 32 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 cpdn.org