Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
ADSL in the US can be very, very badly asymmetric. Not that many years back, I had 25Mbit down, 768kbit up (yes, not even 1Mbit up). Are the current OpenIFS models supposed to be multithreaded? If so, is there an incantation to make them do it? Mine seem to be just single thread right now. |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
Hi! Finally got boinc running again but only the 7.16.6 from the repos on Mint 20.3 in my limited system directory Error: " climateprediction.net: Notice from server OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB. Tue 29 Nov 2022 09:56:52 AM CET" 38 Gb sounds alot, normal? Asteroids@Home runs fine |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Finally got boinc running again but only the 7.16.6 from the repos on Mint 20.3 in my limited system directory Sounds mighty greedy to me. I am running 5 projects and Boinc is currently using only 6.64 GBytes. My CPDN is running three OIFS and CPDN is using only 4.36 GBytes Here is one that finished recently. Application version OpenIFS 43r3 Perturbed Surface v1.01 x86_64-pc-linux-gnu Peak working set size 4,769.45 MB Peak disk usage 1,220.29 MB |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
Computation error: oifs_43r3_ps_08078_221050100_123945 cracks up in the end, two trickles refuse to upload oops, after along wait they disappeard from the upload list |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Is it the amount of trickles that's an issue? Or the total amount of data? The model output for the complete forecast is split into the smaller trickle files (to ease the data upload burden). We could do fewer trickles but the total data size would be the same (each trickle would be larger). I wonder how the Internet links, and Internet Service Providers supporting the CPDN community are distributed. It seems to me that the problem, if there really is a problem, is the product of the number of trickles times the size of a trickle. So making the trickles larger so as to send fewer of them would make no difference. But that is just my intuition; I have no data to back up my opinion. I'm assuming it's the total size of the upload (sum of all trickle sizes) that's a problem? We can ask the scientist to reduce the model output if necessary. Since, if I do not have trouble uploading my trickles as things are now (one trickle per work unit every 8 minutes or so, and a trickle takes an average of 5 seconds to send), I sure do not want the model output to be reduced on my account. But I have a wide-band fiber-optic Internet connection and am running only 3 OIFS work units at a time. And what happens if some high powered user has 10 computers each running 10 work-units on the same Internet connection? And if that is not enough, what if the models start running multi-processing work units like some Milky-way ones do? On the one hand, one could argue that they could just give up on users with slow connections. That seems a shame and unfair. On the other hand, since at least some users have no trouble with their Internet connections easily keeping up with the (current) demands of sending the trickles, there should be no reason to compromise the amount of data they are allowed to send.. Has the project received sufficient work to know and understand the population's Internet connections so a reasonable policy can be determined? |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
I've got a few of these new units. So far two completed ok and two with errors.Ah! Excellent. I've been trying to understand why some tasks are apparently stopping with nothing in the stderr.txt returned to the server to explain why it stopped. https://www.cpdn.org/show_host_detail.php?hostid=1534740 $ uname -a Linux Zen 5.15.0-53-generic #59~20.04.1-Ubuntu SMP Thu Oct 20 15:10:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux https://www.cpdn.org/result.php?resultid=22245900 https://www.cpdn.org/result.php?resultid=22245702 I have the same number of each running at the moment. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
On the issue of result data size: I have a few dual-processor Xeon and Epyc computers at home, but at the same time a simple cable modem connection. I am observing an upload speed at the order of 8 Mbit/s and figure that the result rate from a single 32-core Epyc already saturates this link. That is, I can only put a fraction of my CPUs to OpenIFS in steady state due to my upload bandwidth limitation. On computation errors: None so far here, with this exception: I have OpenIFS on three computers for now (but see above) and took the risk of rebooting one of them. This were the steps which I took; automated: Switch "leave non-GPU tasks in memory while suspended" from on to off. For each task in order of ascending elapsed time, request to suspend the task and wait until it did. Shut down client. Reboot computer. Revert the prior steps in reverse order. After that, the tasks resumed seemingly OK at first, but exited with computation error later. Alas I didn't take note if only some or all of the tasks which went through the suspend-resume cycle were affected. Example: task 22245231 (oifs_43r3_ps_0199_2021050100_123_945_12163288_0). — Next time when I want to reboot a computer with OpenIFS on it, I'll let these tasks complete first, which is manageable as long as the task durations are as short as with the current work. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,816,935 RAC: 19,934 |
... request to suspend the task and wait until it did. ... At what time point (BOINC elapsed time) did you suspend the referenced task, about 16 minutes ? |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
At least for me, the issue with uploads is the total amount of data. After a while running just two at once, I ended up with a massive queue. I will let my machine carry on working once I have cleared some of the backlog. Three completed, 2 failed though nothing I can see in stderr to explain it. Both uploaded final zips. Another is uploading. I have stopped tasks running as new zips seem to be jumping the queue. Edit: I have something over 300zips waiting in the queue! |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
Just thinking, if there are problems with some versions of GCC on hosts, maybe long term the answer would be to have Linux hosts as well as Windows ones use VB? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
The metric we need to be looking at is the total average data flow rate. That's the integral of the upload file size over time. The problem arises if the time taken to upload a single file is longer than the interval between file production. Making the files 'bigger but further apart' won't help that average speed. I'm currently running 8 tasks - 4 each on two machines - and uploading them all over the same broadband link. I'm generating more than 1 file per minute, but each file uploads in 10 or 11 seconds. My line can sustain that indefinitely - it's fibre running at 79 Mbps download, 15.8 Mbps upload - but if the Oxford servers take a break, there'll be one heck of a car-crash. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,816,935 RAC: 19,934 |
I haven't seen a single upload yet (they show as going through in Event Log though) but I do have a good internet connection. My BOINC seems to be learning quickly as estimated completion time for OpenIFS has gone down from 2+ days to just under 11h:45m after 1 completed but failed task that ran just over 14 hours and 1 newly downloaded task. Unfortunately, after looking at error files, I expect to have more failed OpenIFS tasks. I do have a developing hypothesis for a possible connection for at least some of the failures. |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,921,875 RAC: 8,353 |
I bought 64 GB RAM for one of my Linux Computers, so I would able to process in parallel more of the new OpenIFS WUs. Has there be even mentioned, that each WU will generate about 1 GB of tickles in the previous conversation in preparation of this OpenIFS WUs? ADSL: ADSL in the US can be very, very badly asymmetric. Not that many years back, I had 25Mbit down, 768kbit up (yes, not even 1Mbit up).Same here (2Mbit down, 768kbit up)! I am not able to change to something faster as I work with it! I have 3 working LINUX computers behind this ADSL line and intended to start 2 more. But this will not happen, I am not even able to upload tickles of 2 parallel running WUs. That is, I can only put a fraction of my CPUs to OpenIFS in steady state due to my upload bandwidth limitation.This wraps it up nicely! This data volume is not sustainable! Now it is not 32bit-libs, now it is bandwidth! Error: " climateprediction.net: Notice from serverSame problem with my WSL2 installation – this computer would not have a pronounced bandwidth problem, but does not download any tasks as OpenIFS asks constantly for 38146.97 MB! climateprediction.net project-folder on the Linux Computer needs only 8.1 GB! I think less tickles with around 100MB each would be easier for BOINC to handle. But there are more knowledgable participants in the forum. And yes the overall size should be reduced to something more manageable as bandwidth limiting the throughput of WUs over all participants. OpenIFS Errors: https://www.cpdn.org/result.php?resultid=22245289 https://www.cpdn.org/result.php?resultid=22245630 After adjusting app_config from max 4 WUs to 1 WU and afterwards to 2 WUs. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
The metric we need to be looking at is the total average data flow rate. That's the integral of the upload file size over time. The problem arises if the time taken to upload a single file is longer than the interval between file production. Making the files 'bigger but further apart' won't help that average speed.That's what wasn't clear to me - whether the sheer number of zipfiles (Dave reported having over 300) could cause problems for the client & filespace, or whether the dataflow was the issue. We can reduce the data output from the model, but maybe we need to control how many tasks run at once on a volunteer machine? Anyone saturating their threadrippers with OpenIFS tasks is going to have a big outgoing queue. It's not so easy to control what the volunteers do. I'm open to suggestions. Am about to post on the 'New work discussion 2' thread about other ongoing problems with the OIFS tasks as people are reporting problems there too. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
38Gb is very high. I will look into this. I am running multiple tasks and not seeing anything like this number. I take your point about ADSL. I've been on fibre for so long I'd forgotten about these speeds. Obviously there should have been more testing on a larger scale. Apologies for the inconvenience. Error: " climateprediction.net: Notice from serverSame problem with my WSL2 installation – this computer would not have a pronounced bandwidth problem, but does not download any tasks as OpenIFS asks constantly for 38146.97 MB! climateprediction.net project-folder on the Linux Computer needs only 8.1 GB! |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
38Gb is very high. I will look into this. I am running multiple tasks and not seeing anything like this number. I take your point about ADSL. I've been on fibre for so long I'd forgotten about these speeds.Strange, with 9 tasks in progress and another 21 in the queue CPDN is only using up 18.25GB. I could try reducing the amount of disk available to BOINC to see when it starts to object. (Currently the disk tab is showing 120GB free and available to BOINC.) Edit: Setting BOINC to use no more than 28GB leaving only 5GB free available to BOINC still lets OpenIFS tasks run without problems. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
There are possibly two separate issues here. 1) On the server, when it's considering whether a task can be issued to the host requesting it. 2) On the local host, when the client is deciding whether an allocated and downloaded task can be run. Either or both may be related to the XML specification for the workunit, which contains <rsc_disk_bound> 40,000,000,000My calculator makes that 37.25 GB, using 1024x steps. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Error: " climateprediction.net: Notice from server I have the same problem, but it may be because Store at was set to 10 days and Store up to a 10 days (so at MAX values). So the server might be trying to push more WUs than the available space for Storage. I've changed these to see if I will get any WU. |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,921,875 RAC: 8,353 |
Either or both may be related to the XML specification for the workunit, which contains:Glenn, this is the problem, the model asks for too much disk space before it even starts to download a OpenIFS Wu. It is not, as you have written in the other thread, that the model has already downloaded OpenIFS Wus and after several crashes has run out of allocated disk space in WSL2. BOINC simple sees there is not sufficient space allowed on the hard-disk on a particular computer and therefor refuses starting to download a WU.<rsc_disk_bound> 40,000,000,000My calculator makes that 37.25 GB, using 1024x steps. I have already checked if my WSL2 disk is full with crashed models and there is none! As far as I understand the researcher has to reduce this to the size it is really needed: <rsc_disk_bound> 40,000,000,000It might be just one decimal less. In the meanwhile another model crashed. As there has been a “traffic jam” with another BOINC project (SIDOCK produces small WUs with a huge data-output, and releases these once a day) climateprediction.net tickles have not been uploaded in time. OpenIFS reports no tickles – they have been waiting in line… I have another question: Once the multicore models start to be released, the tickles will be even larger and appear in a higher frequency or is there no direct link between multithreaded and tickle size/frequency? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Great, thanks for checking that. 37Gb is too high. This value is not set by the scientist but by the CPDN team. I do recall a conversation with Andy where he said he 'guess-timates' these sizes and in the past has wondered why the tasks are never sent by the server. I've also found I had to remove any disk limits in my boincmgr in order to start getting tasks. I will bring this up with CPDN, however, I'll run it standalone and see how much it produces. That covers the worse case scenario where all the model output is stored locally until the model ends. Multi-core. For the same model resolution the trickle size will be the same (unless we change it). The trickle production rate will increase by roughly the same number of cores. The real reason to go multi-core is for the higher resolution model configurations I've discussed with CPDN in order to complete them faster. Not the lower resolutions CPDN is currently running. That could mean increases in file sizes but not necessarily. We could consider running multiple batches with different output to complete the full dataset output the scientist wants. Regardless of the no. of cores, we can still adjust trickle rate & size (or data-flow rate as Richard eloquently puts it - put another way, data-production rate from the model). We are some way off multi-core at the moment :D Either or both may be related to the XML specification for the workunit, which contains:Glenn, this is the problem, the model asks for too much disk space before it even starts to download a OpenIFS Wu. It is not, as you have written in the other thread, that the model has already downloaded OpenIFS Wus and after several crashes has run out of allocated disk space in WSL2. BOINC simple sees there is not sufficient space allowed on the hard-disk on a particular computer and therefor refuses starting to download a WU.<rsc_disk_bound> 40,000,000,000My calculator makes that 37.25 GB, using 1024x steps. |
©2024 cpdn.org