OpenIFS Discussion

Author	Message
SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 66642 - Posted: 29 Nov 2022, 23:31:44 UTC ADSL in the US can be very, very badly asymmetric. Not that many years back, I had 25Mbit down, 768kbit up (yes, not even 1Mbit up). Are the current OpenIFS models supposed to be multithreaded? If so, is there an incantation to make them do it? Mine seem to be just single thread right now. ID: 66642 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 66644 - Posted: 29 Nov 2022, 23:56:31 UTC Hi! Finally got boinc running again but only the 7.16.6 from the repos on Mint 20.3 in my limited system directory Error: " climateprediction.net: Notice from server OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB. Tue 29 Nov 2022 09:56:52 AM CET" 38 Gb sounds alot, normal? Asteroids@Home runs fine ID: 66644 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66646 - Posted: 30 Nov 2022, 1:06:23 UTC - in response to Message 66644. Finally got boinc running again but only the 7.16.6 from the repos on Mint 20.3 in my limited system directory Error: " climateprediction.net: Notice from server OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB. Tue 29 Nov 2022 09:56:52 AM CET" 38 Gb sounds alot, normal? Asteroids@Home runs fine Sounds mighty greedy to me. I am running 5 projects and Boinc is currently using only 6.64 GBytes. My CPDN is running three OIFS and CPDN is using only 4.36 GBytes Here is one that finished recently. Application version OpenIFS 43r3 Perturbed Surface v1.01 x86_64-pc-linux-gnu Peak working set size 4,769.45 MB Peak disk usage 1,220.29 MB ID: 66646 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 66647 - Posted: 30 Nov 2022, 1:33:47 UTC Computation error: oifs_43r3_ps_08078_221050100_123945 cracks up in the end, two trickles refuse to upload oops, after along wait they disappeard from the upload list ID: 66647 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66648 - Posted: 30 Nov 2022, 2:55:38 UTC - in response to Message 66641. Is it the amount of trickles that's an issue? Or the total amount of data? The model output for the complete forecast is split into the smaller trickle files (to ease the data upload burden). We could do fewer trickles but the total data size would be the same (each trickle would be larger). I wonder how the Internet links, and Internet Service Providers supporting the CPDN community are distributed. It seems to me that the problem, if there really is a problem, is the product of the number of trickles times the size of a trickle. So making the trickles larger so as to send fewer of them would make no difference. But that is just my intuition; I have no data to back up my opinion. I'm assuming it's the total size of the upload (sum of all trickle sizes) that's a problem? We can ask the scientist to reduce the model output if necessary. Since, if I do not have trouble uploading my trickles as things are now (one trickle per work unit every 8 minutes or so, and a trickle takes an average of 5 seconds to send), I sure do not want the model output to be reduced on my account. But I have a wide-band fiber-optic Internet connection and am running only 3 OIFS work units at a time. And what happens if some high powered user has 10 computers each running 10 work-units on the same Internet connection? And if that is not enough, what if the models start running multi-processing work units like some Milky-way ones do? On the one hand, one could argue that they could just give up on users with slow connections. That seems a shame and unfair. On the other hand, since at least some users have no trouble with their Internet connections easily keeping up with the (current) demands of sending the trickles, there should be no reason to compromise the amount of data they are allowed to send.. Has the project received sufficient work to know and understand the population's Internet connections so a reasonable policy can be determined? ID: 66648 · Reply Quote

Dark Angel Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174	Message 66649 - Posted: 30 Nov 2022, 6:19:39 UTC - in response to Message 66637. I've got a few of these new units. So far two completed ok and two with errors. The first error log ends with: Uploading trickle at timestep: 1900800 00:22:36 STEP 530 H= 530:00 +CPU= 15.541 double free or corruption (out) The other: 18:58:37 STEP 482 H= 482:00 +CPU= 10.168 free(): invalid next size (fast) Ah! Excellent. I've been trying to understand why some tasks are apparently stopping with nothing in the stderr.txt returned to the server to explain why it stopped. @DarkAngel - can you tell me which resultids those were so I can look them up? Also, what machine & OS are you using these on? This kind of error message indicates a memory problem, often caused by a bug in the code but I've also seen it caused by certain versions of compilers/system libraries. I've never seen it with the model itself but then I've never run the model on such a wide range of systems like this. Could also be the wrapper code we use. Quick question. When the tasks are running, if you do 'ps -ef' you should see the same number of 'master.exe' processes as 'oifs_43r3_ps_1.01_x86_64-pc-linux-gnu'. The latter is the 'controller' for the model itself (master.exe). Do you have the same number of each? I ask because we know of one issue that can kill the 'oifs_43r3....' process running but still leave the model 'master.exe' running. Thanks for your help. https://www.cpdn.org/show_host_detail.php?hostid=1534740 $ uname -a Linux Zen 5.15.0-53-generic #59~20.04.1-Ubuntu SMP Thu Oct 20 15:10:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux https://www.cpdn.org/result.php?resultid=22245900 https://www.cpdn.org/result.php?resultid=22245702 I have the same number of each running at the moment. ID: 66649 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 66650 - Posted: 30 Nov 2022, 7:23:35 UTC Last modified: 30 Nov 2022, 7:25:52 UTC On the issue of result data size: I have a few dual-processor Xeon and Epyc computers at home, but at the same time a simple cable modem connection. I am observing an upload speed at the order of 8 Mbit/s and figure that the result rate from a single 32-core Epyc already saturates this link. That is, I can only put a fraction of my CPUs to OpenIFS in steady state due to my upload bandwidth limitation. On computation errors: None so far here, with this exception: I have OpenIFS on three computers for now (but see above) and took the risk of rebooting one of them. This were the steps which I took; automated: Switch "leave non-GPU tasks in memory while suspended" from on to off. For each task in order of ascending elapsed time, request to suspend the task and wait until it did. Shut down client. Reboot computer. Revert the prior steps in reverse order. After that, the tasks resumed seemingly OK at first, but exited with computation error later. Alas I didn't take note if only some or all of the tasks which went through the suspend-resume cycle were affected. Example: task 22245231 (oifs_43r3_ps_0199_2021050100_123_945_12163288_0). — Next time when I want to reboot a computer with OpenIFS on it, I'll let these tasks complete first, which is manageable as long as the task durations are as short as with the current work. ID: 66650 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,780,446 RAC: 19,423	Message 66651 - Posted: 30 Nov 2022, 7:38:56 UTC - in response to Message 66650. ... request to suspend the task and wait until it did. ... At what time point (BOINC elapsed time) did you suspend the referenced task, about 16 minutes ? ID: 66651 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,966,742 RAC: 21,869	Message 66652 - Posted: 30 Nov 2022, 8:41:06 UTC Last modified: 30 Nov 2022, 8:51:06 UTC At least for me, the issue with uploads is the total amount of data. After a while running just two at once, I ended up with a massive queue. I will let my machine carry on working once I have cleared some of the backlog. Three completed, 2 failed though nothing I can see in stderr to explain it. Both uploaded final zips. Another is uploading. I have stopped tasks running as new zips seem to be jumping the queue. Edit: I have something over 300zips waiting in the queue! ID: 66652 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,966,742 RAC: 21,869	Message 66653 - Posted: 30 Nov 2022, 8:44:09 UTC Just thinking, if there are problems with some versions of GCC on hosts, maybe long term the answer would be to have Linux hosts as well as Windows ones use VB? ID: 66653 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,861 RAC: 10,559	Message 66654 - Posted: 30 Nov 2022, 9:17:04 UTC The metric we need to be looking at is the total average data flow rate. That's the integral of the upload file size over time. The problem arises if the time taken to upload a single file is longer than the interval between file production. Making the files 'bigger but further apart' won't help that average speed. I'm currently running 8 tasks - 4 each on two machines - and uploading them all over the same broadband link. I'm generating more than 1 file per minute, but each file uploads in 10 or 11 seconds. My line can sustain that indefinitely - it's fibre running at 79 Mbps download, 15.8 Mbps upload - but if the Oxford servers take a break, there'll be one heck of a car-crash. ID: 66654 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,780,446 RAC: 19,423	Message 66655 - Posted: 30 Nov 2022, 9:26:46 UTC I haven't seen a single upload yet (they show as going through in Event Log though) but I do have a good internet connection. My BOINC seems to be learning quickly as estimated completion time for OpenIFS has gone down from 2+ days to just under 11h:45m after 1 completed but failed task that ran just over 14 hours and 1 newly downloaded task. Unfortunately, after looking at error files, I expect to have more failed OpenIFS tasks. I do have a developing hypothesis for a possible connection for at least some of the failures. ID: 66655 · Reply Quote

klepel Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,916,905 RAC: 9,226	Message 66656 - Posted: 30 Nov 2022, 10:06:25 UTC - in response to Message 66642. I bought 64 GB RAM for one of my Linux Computers, so I would able to process in parallel more of the new OpenIFS WUs. Has there be even mentioned, that each WU will generate about 1 GB of tickles in the previous conversation in preparation of this OpenIFS WUs? ADSL: ADSL in the US can be very, very badly asymmetric. Not that many years back, I had 25Mbit down, 768kbit up (yes, not even 1Mbit up). Same here (2Mbit down, 768kbit up)! I am not able to change to something faster as I work with it! I have 3 working LINUX computers behind this ADSL line and intended to start 2 more. But this will not happen, I am not even able to upload tickles of 2 parallel running WUs. That is, I can only put a fraction of my CPUs to OpenIFS in steady state due to my upload bandwidth limitation. This wraps it up nicely! This data volume is not sustainable! Now it is not 32bit-libs, now it is bandwidth! Error: " climateprediction.net: Notice from server OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB. Tue 29 Nov 2022 09:56:52 AM CET" 38 Gb sounds alot, normal? Same problem with my WSL2 installation – this computer would not have a pronounced bandwidth problem, but does not download any tasks as OpenIFS asks constantly for 38146.97 MB! climateprediction.net project-folder on the Linux Computer needs only 8.1 GB! I think less tickles with around 100MB each would be easier for BOINC to handle. But there are more knowledgable participants in the forum. And yes the overall size should be reduced to something more manageable as bandwidth limiting the throughput of WUs over all participants. OpenIFS Errors: https://www.cpdn.org/result.php?resultid=22245289 https://www.cpdn.org/result.php?resultid=22245630 After adjusting app_config from max 4 WUs to 1 WU and afterwards to 2 WUs. ID: 66656 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66657 - Posted: 30 Nov 2022, 10:18:41 UTC - in response to Message 66654. The metric we need to be looking at is the total average data flow rate. That's the integral of the upload file size over time. The problem arises if the time taken to upload a single file is longer than the interval between file production. Making the files 'bigger but further apart' won't help that average speed. That's what wasn't clear to me - whether the sheer number of zipfiles (Dave reported having over 300) could cause problems for the client & filespace, or whether the dataflow was the issue. We can reduce the data output from the model, but maybe we need to control how many tasks run at once on a volunteer machine? Anyone saturating their threadrippers with OpenIFS tasks is going to have a big outgoing queue. It's not so easy to control what the volunteers do. I'm open to suggestions. Am about to post on the 'New work discussion 2' thread about other ongoing problems with the OIFS tasks as people are reporting problems there too. ID: 66657 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66658 - Posted: 30 Nov 2022, 10:29:43 UTC - in response to Message 66656. 38Gb is very high. I will look into this. I am running multiple tasks and not seeing anything like this number. I take your point about ADSL. I've been on fibre for so long I'd forgotten about these speeds. Obviously there should have been more testing on a larger scale. Apologies for the inconvenience. Error: " climateprediction.net: Notice from server OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB. Tue 29 Nov 2022 09:56:52 AM CET" 38 Gb sounds alot, normal? Same problem with my WSL2 installation – this computer would not have a pronounced bandwidth problem, but does not download any tasks as OpenIFS asks constantly for 38146.97 MB! climateprediction.net project-folder on the Linux Computer needs only 8.1 GB! I think less tickles with around 100MB each would be easier for BOINC to handle. But there are more knowledgable participants in the forum. And yes the overall size should be reduced to something more manageable as bandwidth limiting the throughput of WUs over all participants. OpenIFS Errors: https://www.cpdn.org/result.php?resultid=22245289 https://www.cpdn.org/result.php?resultid=22245630 After adjusting app_config from max 4 WUs to 1 WU and afterwards to 2 WUs. ID: 66658 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,966,742 RAC: 21,869	Message 66659 - Posted: 30 Nov 2022, 10:36:55 UTC Last modified: 30 Nov 2022, 11:01:03 UTC 38Gb is very high. I will look into this. I am running multiple tasks and not seeing anything like this number. I take your point about ADSL. I've been on fibre for so long I'd forgotten about these speeds. Strange, with 9 tasks in progress and another 21 in the queue CPDN is only using up 18.25GB. I could try reducing the amount of disk available to BOINC to see when it starts to object. (Currently the disk tab is showing 120GB free and available to BOINC.) Edit: Setting BOINC to use no more than 28GB leaving only 5GB free available to BOINC still lets OpenIFS tasks run without problems. ID: 66659 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,690,861 RAC: 10,559	Message 66661 - Posted: 30 Nov 2022, 11:10:07 UTC - in response to Message 66659. There are possibly two separate issues here. 1) On the server, when it's considering whether a task can be issued to the host requesting it. 2) On the local host, when the client is deciding whether an allocated and downloaded task can be run. Either or both may be related to the XML specification for the workunit, which contains <rsc_disk_bound> 40,000,000,000 My calculator makes that 37.25 GB, using 1024x steps. ID: 66661 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981	Message 66662 - Posted: 30 Nov 2022, 11:48:37 UTC - in response to Message 66656. Error: " climateprediction.net: Notice from server OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB. Tue 29 Nov 2022 09:56:52 AM CET" 38 Gb sounds alot, normal? I have the same problem, but it may be because Store at was set to 10 days and Store up to a 10 days (so at MAX values). So the server might be trying to push more WUs than the available space for Storage. I've changed these to see if I will get any WU. ID: 66662 · Reply Quote

klepel Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,916,905 RAC: 9,226	Message 66663 - Posted: 30 Nov 2022, 11:58:40 UTC - in response to Message 66661. Either or both may be related to the XML specification for the workunit, which contains: <rsc_disk_bound> 40,000,000,000 My calculator makes that 37.25 GB, using 1024x steps. Glenn, this is the problem, the model asks for too much disk space before it even starts to download a OpenIFS Wu. It is not, as you have written in the other thread, that the model has already downloaded OpenIFS Wus and after several crashes has run out of allocated disk space in WSL2. BOINC simple sees there is not sufficient space allowed on the hard-disk on a particular computer and therefor refuses starting to download a WU. I have already checked if my WSL2 disk is full with crashed models and there is none! As far as I understand the researcher has to reduce this to the size it is really needed: <rsc_disk_bound> 40,000,000,000 It might be just one decimal less. In the meanwhile another model crashed. As there has been a “traffic jam” with another BOINC project (SIDOCK produces small WUs with a huge data-output, and releases these once a day) climateprediction.net tickles have not been uploaded in time. OpenIFS reports no tickles – they have been waiting in line… I have another question: Once the multicore models start to be released, the tickles will be even larger and appear in a higher frequency or is there no direct link between multithreaded and tickle size/frequency? ID: 66663 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66664 - Posted: 30 Nov 2022, 14:28:10 UTC - in response to Message 66663. Last modified: 30 Nov 2022, 14:38:12 UTC Great, thanks for checking that. 37Gb is too high. This value is not set by the scientist but by the CPDN team. I do recall a conversation with Andy where he said he 'guess-timates' these sizes and in the past has wondered why the tasks are never sent by the server. I've also found I had to remove any disk limits in my boincmgr in order to start getting tasks. I will bring this up with CPDN, however, I'll run it standalone and see how much it produces. That covers the worse case scenario where all the model output is stored locally until the model ends. Multi-core. For the same model resolution the trickle size will be the same (unless we change it). The trickle production rate will increase by roughly the same number of cores. The real reason to go multi-core is for the higher resolution model configurations I've discussed with CPDN in order to complete them faster. Not the lower resolutions CPDN is currently running. That could mean increases in file sizes but not necessarily. We could consider running multiple batches with different output to complete the full dataset output the scientist wants. Regardless of the no. of cores, we can still adjust trickle rate & size (or data-flow rate as Richard eloquently puts it - put another way, data-production rate from the model). We are some way off multi-core at the moment :D Either or both may be related to the XML specification for the workunit, which contains: <rsc_disk_bound> 40,000,000,000 My calculator makes that 37.25 GB, using 1024x steps. Glenn, this is the problem, the model asks for too much disk space before it even starts to download a OpenIFS Wu. It is not, as you have written in the other thread, that the model has already downloaded OpenIFS Wus and after several crashes has run out of allocated disk space in WSL2. BOINC simple sees there is not sufficient space allowed on the hard-disk on a particular computer and therefor refuses starting to download a WU. I have already checked if my WSL2 disk is full with crashed models and there is none! As far as I understand the researcher has to reduce this to the size it is really needed: <rsc_disk_bound> 40,000,000,000 It might be just one decimal less. In the meanwhile another model crashed. As there has been a “traffic jam” with another BOINC project (SIDOCK produces small WUs with a huge data-output, and releases these once a day) climateprediction.net tickles have not been uploaded in time. OpenIFS reports no tickles – they have been waiting in line… I have another question: Once the multicore models start to be released, the tickles will be even larger and appear in a higher frequency or is there no direct link between multithreaded and tickle size/frequency? ID: 66664 · Reply Quote