Thread 'OpenIFS Discussion'

Author	Message
Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 66831 - Posted: 9 Dec 2022, 11:09:17 UTC Just in the final stage of testing a new version of the OpenIFS wrapper/controller code. Thanks to the bug reports and some debugging, we've isolated and fixed hopefully most of the issues. The task XML has also been corrected: memory_bound has gone up, disk_bound is now 10x lower. There are about 6,500 tasks of the oifs_43r3_bl app waiting to go with another 39,000 of the oifs_43r3_ps app too. These should be going out once we're happy the tests of the new code. ID: 66831 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 66832 - Posted: 9 Dec 2022, 11:11:49 UTC - in response to Message 66828. Last modified: 9 Dec 2022, 11:14:11 UTC Just successfully finished a task that failed on 2 other machines: https://www.cpdn.org/workunit.php?wuid=12165996. One the PCs it failed on has crashed all of the few dozen OIFS tasks it attempted and still has about a dozen to go: (https://www.cpdn.org/results.php?hostid=1536378&offset=0&show_names=0&state=0&appid=39). It has a pretty old CPU, Xeon E5530, I wonder if that has something to do with it. I wonder if it's possible that the code is too specialized, if I can put it that way, in that it won't run well on a variety of configurations and only on certain, preferred configurations? I don't regard OpenIFS as a specialized application. I had a look, the disk filled up: 08:49:03 STEP 96 H= 96:00 +CPU=109.067 forrtl: No space left on device forrtl: severe (38): error during write, unit 20, file /var/lib/boinc-client/slots/14/NODE.001_01 ID: 66832 · Reply Quote

biodoc Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265	Message 66834 - Posted: 9 Dec 2022, 12:42:14 UTC I used an app_config.xml file in CPND project directory to control the number of OpenIFS tasks running simultaneously on each computer. I started out with 8 tasks on my 16 core Zen3 and Zen4 computers each with 64 GB of RAM and eventually reduced that to 6 tasks to be absolutely sure there was enough RAM to support 6 tasks running simultaneously. My experience with my old dual Ivy bridge was poor at best. I ran 10 tasks simultaneously (1 task for each real core). This computer has 96 GB of ECC RAM so I assumed that would be enough. My internet connection is rated at 110 Mbits/sec down and upload so that should be enough to handle the trickle and final uploads of 34-42 total tasks running simultaneously. As posted by others in the CPND forum, rebooting the computer is a bad idea since all tasks running at the time will eventually fail. I lost 10 tasks on my dual ivy testing this. Another observation made by others was enabling "leave nonGPU tasks in memory while suspended" in boinc options is a necessity for completing tasks successfully that have been temporarily suspended for any reason. Summary of OpenIFS results by computer: 3950X, 64 GB RAM, linux mint 20.3. MW running on 4 instances (Radeon VII) 33 tasks completed successfully, no errors. This was the only computer at the start I had with "leave nonGPU tasks in memory while suspended" enabled in boinc options. 3950X, 64 GB RAM, linux mint 20.3. F@H running on nvidia GPU. 32 successful tasks, 1 error (suspended and restarted without "leave nonGPU tasks in memory while suspended" enabled in boinc options). 5950X, 64 GB ECC RAM, linux mint 21. F@H running on nvidia GPU. 29 successful tasks, 2 errors. One error was suspended and restarted without "leave nonGPU tasks in memory while suspended" enabled in boinc options. Another error was 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT. It's not clear what the problem was in the stderr output. 5950X, 64 GB ECC RAM, kubuntu 20.04. F@H running on nvidia GPU. 41 successful tasks. 7 errors. 4 "double free or corruption (out)" at end of stderr output. 2 "free(): invalid pointer" at end of stderr output. 1 "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT" This computer is my main computer so I have other crap running including chrome, boinctasks-js, discord and boinc manager from time to time. 8 simultaneous tasks was not sustainable due to RAM limits (90% available to boinc) 7 simultaneous tasks were borderline and 6 seemed about right. Also FAH core 22 tasks reserve 2.5 GB of system RAM The other Zens are Headless dedicated DC rigs. The dual ivy bridge was a comedy of errors mostly my fault: 6 completed tasks, 16 errors (10 errors due to system reboot and 2 due to "leave nonGPU tasks in memory while suspended" disabled in boinc options). . ID: 66834 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 66835 - Posted: 9 Dec 2022, 13:22:19 UTC - in response to Message 66834. I used an app_config.xml file in CPND project directory to control the number of OpenIFS tasks running simultaneously on each computer. I started out with 8 tasks on my 16 core Zen3 and Zen4 computers each with 64 GB of RAM and eventually reduced that to 6 tasks to be absolutely sure there was enough RAM to support 6 tasks running simultaneously. I am currently running 4 OpenIFS tasks simultaneously and 8 non-CPDN tasks as well on my 64 GByte 16-core machine. The machine is not doing much else at the moment other than following my typing here in Firefox. Here is my memory usage (in MegaBytes). Note that in addition to the 1,428 MegaBytes of free RAM, there is more space quickly available in the disk cash currently loaded on the machine, so there is actually 41882 MegaBytes of RAM available. total used free shared buffers cache available Mem: 63772 20995 1428 181 459 40889 41882 Swap: 15991 247 15744 ID: 66835 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 66836 - Posted: 9 Dec 2022, 13:50:40 UTC - in response to Message 66834. Last modified: 9 Dec 2022, 14:00:32 UTC I think setting "leave nonGPU tasks in memory while suspended" is definitely recommended, unless the computer absolutely needs that memory to carry out its day job. I saw some distinctive errors in the dev site which would only have been triggered if 'leave in memory' wasn't set. I saw those only in Baroclinic Lifecycle jobs, which I don't think we've seen here on the main site - but it won't harm to set it anyway. I've just picked up two resends from dev - one plain, and one Baroclinic Lifecycle. Both are set to use new application versions deployed today - hopefully they contain the fixes Glenn mentioned: I'll give them a thorough workout as the day progresses. Edit - those two errored tasks, and three more on my other machine, failed immediately with error code 193 'SIGSEGV: segmentation violation'. Once each on machines belonging to Dave and to Glenn. That testing may not take long after all! ID: 66836 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 66837 - Posted: 9 Dec 2022, 14:21:26 UTC - in response to Message 66836. Edit - those two errored tasks, and three more on my other machine, failed immediately with error code 193 'SIGSEGV: segmentation violation'. Once each on machines belonging to Dave and to Glenn. That testing may not take long after all! It will come as no surprise then that the two _2 tasks I ran from the last chance saloon also errored out. (Now added, "errored" to Firefox dictionary after putting up with being told it is a misspelling for more years than I care to remember.) ID: 66837 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 66838 - Posted: 9 Dec 2022, 14:34:08 UTC - in response to Message 66837. So have mine. Just waiting for a core to come free on the second machine. ID: 66838 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 66839 - Posted: 9 Dec 2022, 15:02:58 UTC Next test application Baroclinic Lifecycle v1.07 has started running properly. ID: 66839 · Reply Quote

biodoc Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265	Message 66840 - Posted: 9 Dec 2022, 15:12:42 UTC I'd like to participate in the development project. How do I join? ID: 66840 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 66841 - Posted: 9 Dec 2022, 15:32:54 UTC - in response to Message 66839. Last modified: 9 Dec 2022, 15:42:31 UTC Next test application Baroclinic Lifecycle v1.07 has started running properly. Two from ~D537 completed, two uploading, believed to be successes and one still running. Edit: now up to four out of five successes with one still uploading. ID: 66841 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 66842 - Posted: 9 Dec 2022, 15:40:04 UTC - in response to Message 66840. I'd like to participate in the development project. How do I join? Participation in the testing side of things has in the past been by invitation to those who have demonstrated over time that they run tasks reliably and have also demonstrated over time that they know enough about the issues involved in both BOINC and CPDN in particular to be helpful to others with issues. I had been active here for many years before I joined the testing programme. I think that the suggestion of a significant expansion of the number of people involved will need to be agreed by the project before it comes to pass. Most testing batches of work are relatively small, ten tasks or less so even with the current low numbers involved I can go for months without seeing any testing work even if I notice some going on. ID: 66842 · Reply Quote

biodoc Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265	Message 66843 - Posted: 9 Dec 2022, 16:28:26 UTC - in response to Message 66842. Fair enough. Let me know if you need more help. ID: 66843 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 66844 - Posted: 9 Dec 2022, 16:35:31 UTC Last modified: 9 Dec 2022, 16:45:32 UTC Failure to upload. This morning my Transfers tab filled up with messages that have failed to upload for at least two hours. I have no trouble accessing the CPDN web site, writing posts here, even getting new work. But I get stuff like this in the Event Log; Fri 09 Dec 2022 11:16:47 AM EST \| climateprediction.net \| Started upload of oifs_43r3_ps_3071_2021050100_123_947_12166160_1_r505834834_12.zip Fri 09 Dec 2022 11:16:47 AM EST \| climateprediction.net \| Started upload of oifs_43r3_ps_1532_2021050100_123_946_12164621_1_r1430342566_10.zip Fri 09 Dec 2022 11:18:48 AM EST \| \| Project communication failed: attempting access to reference site Fri 09 Dec 2022 11:18:48 AM EST \| climateprediction.net \| Temporarily failed upload of oifs_43r3_ps_3071_2021050100_123_947_12166160_1_r505834834_12.zip: transient HTTP error Fri 09 Dec 2022 11:18:48 AM EST \| climateprediction.net \| Backing off 00:03:12 on upload of oifs_43r3_ps_3071_2021050100_123_947_12166160_1_r505834834_12.zip Fri 09 Dec 2022 11:18:48 AM EST \| climateprediction.net \| Temporarily failed upload of oifs_43r3_ps_1532_2021050100_123_946_12164621_1_r1430342566_10.zip: transient HTTP error Fri 09 Dec 2022 11:18:48 AM EST \| climateprediction.net \| Backing off 00:02:30 on upload of oifs_43r3_ps_1532_2021050100_123_946_12164621_1_r1430342566_10.zip Fri 09 Dec 2022 11:18:49 AM EST \| \| Internet access OK - project servers may be temporarily down. Should I just wait it out, or do something? If so, what? Here is when it started: Fri 09 Dec 2022 05:45:05 AM EST \| climateprediction.net \| Started upload of oifs_43r3_ps_2451_2021050100_123_947_12165540_1_r1207928083_91.zip Fri 09 Dec 2022 05:45:10 AM EST \| climateprediction.net \| Finished upload of oifs_43r3_ps_2451_2021050100_123_947_12165540_1_r1207928083_91.zip Fri 09 Dec 2022 05:48:37 AM EST \| climateprediction.net \| Started upload of oifs_43r3_ps_1353_2021050100_123_946_12164442_2_r637909205_94.zip Fri 09 Dec 2022 05:48:53 AM EST \| \| Project communication failed: attempting access to reference site Fri 09 Dec 2022 05:48:53 AM EST \| climateprediction.net \| Temporarily failed upload of oifs_43r3_ps_1353_2021050100_123_946_12164442_2_r637909205_94.zip: can't resolve hostname Fri 09 Dec 2022 05:48:53 AM EST \| climateprediction.net \| Backing off 00:03:59 on upload of oifs_43r3_ps_1353_2021050100_123_946_12164442_2_r637909205_94.zip Fri 09 Dec 2022 05:48:55 AM EST \| \| Internet access OK - project servers may be temporarily down. ID: 66844 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 66845 - Posted: 9 Dec 2022, 18:13:05 UTC - in response to Message 66844. Last modified: 9 Dec 2022, 18:58:50 UTC Should I just wait it out, or do something? If so, what? Sit tight, I have messaged Andy. I suspect you are not the only one. I don't have any work from Main site at the moment so can't test for myself. Edit@ Glen has posted on Trello card and he is getting the same. he has emailed Andy. ID: 66845 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 66846 - Posted: 9 Dec 2022, 22:20:43 UTC - in response to Message 66845. I have run 4 OpenIFS tasks that completed and am running four more that try to send an upload every 8 minutes (each), so a new one gets added every 2 minutes (roughly). They are just a bit over 14 Megabytes each. But do not worry. I run Boinc in a partition all its own and it has 387 Gigabytes available, so I should be OK for quite a while.. I do hope it is fixed before the January 7 deadline expires. If that happens, do I lose credit for the first four tasks? ID: 66846 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 66847 - Posted: 9 Dec 2022, 22:35:54 UTC OpenIFS apps improvements The updated apps have now been tested, so more OpenIFS tasks next week to the main site (both the oifs_43r3_ps & oifs_43r3_bl apps). The BL app tasks have shorter runtimes than the PS app tasks. There should be more info regarding the BL experiment next week as well. Again, thanks to all who reported & helped analyse the problems. If anyone's interested.... Changes/fixes The 'double free corruption' & 'free()' errors seen before are coming from corrupt XML files used to describe the task. It's not entirely clear what's causing the corruption on volunteer machines but I was able to reproduce it and it's possible the other fixes may solve this. Time will tell. If anyone does see this again, please PM me and try to capture any files with a .xml suffix in the slot directory (I won't be following the forums that closely from now until next yr). Upload files missing after model restart (appears at end of the task). This has been fixed. The problem was the controlling wrapper code lost track of where the model was after a restart and miscalculated the upload file sequence. Should work to shutdown the client/PC and the model resumes from previous checkpoint when the boinc client is restarted. stoi error. This has been fixed. Occurred because of garbage appearing in one of the model output files. Task XML. The memory bound has been revised up, we don't feel OpenIFS is suitable for 8Gb memory machines, we saw alot of tasks fail on them. The disk bound is now a sensible value (was 37Gb) and worked out assuming worse case that all model output is stored pending upload. The model process 'master.exe' has been renamed to 'oifs_43r3_model.exe' to make it clearer in the process list. Still recommend turning on 'keep non-GPU processes in memory' for now as this potentially will stop the model process being killed and then restarting from a checkpoint (which involves rerunning a few previous steps and will cause the tasks to take longer). p.s. I'd prefer results from devsite tests are not discussed here. They are not meant for general consumption and often I know the dev tasks will fail before they go out - it's how they fail is what I'm trying to understand. ID: 66847 · Reply Quote

biodoc Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265	Message 66848 - Posted: 10 Dec 2022, 10:38:03 UTC - in response to Message 66845. Should I just wait it out, or do something? If so, what? Sit tight, I have messaged Andy. I suspect you are not the only one. I don't have any work from Main site at the moment so can't test for myself. Edit@ Glen has posted on Trello card and he is getting the same. he has emailed Andy. I have 3 OpenIFS tasks running and all trickle uploads are failing due to a server issue. Sat 10 Dec 2022 05:28:49 AM EST \| climateprediction.net \| Backing off 00:02:48 on upload of oifs_43r3_ps_1850_2021050100_123_946_12164939_1_r1444946898_85.zip Sat 10 Dec 2022 05:28:51 AM EST \| \| Internet access OK - project servers may be temporarily down. Sat 10 Dec 2022 05:30:46 AM EST \| \| Project communication failed: attempting access to reference site Sat 10 Dec 2022 05:30:46 AM EST \| climateprediction.net \| Temporarily failed upload of oifs_43r3_ps_1850_2021050100_123_946_12164939_1_r1444946898_86.zip: transient HTTP error Sat 10 Dec 2022 05:30:46 AM EST \| climateprediction.net \| Backing off 00:03:00 on upload of oifs_43r3_ps_1850_2021050100_123_946_12164939_1_r1444946898_86.zip Sat 10 Dec 2022 05:30:47 AM EST \| \| Internet access OK - project servers may be temporarily down. ID: 66848 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 66849 - Posted: 10 Dec 2022, 12:24:47 UTC - in response to Message 66848. I have 3 OpenIFS tasks running and all trickle uploads are failing due to a server issue. It maybe that Andy needs to move data off the server or it may be the script handling the uploads needs to be restarted or indeed something else to do with Oxford's IT system. I know Andy sometimes does sort out things outside of his normal working hours but it may be a case of waiting till Monday. ID: 66849 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 66850 - Posted: 10 Dec 2022, 13:47:13 UTC - in response to Message 66849. It seems to me that everything else at CPDN is working just fine. These discussion boards work just fine. I even got a new CPDN OpenIFS task downloaded about three hours ago that is now running and frustrating me as the number of failing trickles is accumulating. I estimate I have around 500 trickles waiting to be uploaded. So I am impatiently waiting. ID: 66850 · Reply Quote

klepel Send message Joined: 9 Oct 04 Posts: 82 Credit: 70,035,400 RAC: 247	Message 66851 - Posted: 10 Dec 2022, 14:11:14 UTC - in response to Message 66849. Glenn will you release the oifs_43r3_bl and oifs_43r3_ps apps in parallel or in sequence? As I am bandwidth limited I can only run max 3 WUs in parallel on all 3 computers assigned to climateprediction.net. My app_config.xml is configured as follows: <app_config> <project_max_concurrent>4</project_max_concurrent> [â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦.] <app> <name>oifs_43r3</name> <max_concurrent>2</max_concurrent> <report_results_immediately/> </app> <app> <name>oifs_43r3_bl</name> <max_concurrent>2</max_concurrent> <report_results_immediately/> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>2</max_concurrent> <report_results_immediately/> </app> </app_config> So there might run total of 4 WUs of oifs_43r3_bl and oifs_43r3_ps in parallel: I was hesitating to limite project_max_concurrent further as there might appear some HadSM4 WUs (they do not have the bandwidth problem) and I happily crunch them in parallel the oIFS or I might forget to increase it after the oIFS disappear. For the BOINC specialists, if I set one of the two apps in app_config.xml to 0 (zero), as an example: <app> <name>oifs_43r3_bl</name> <max_concurrent>0</max_concurrent> <report_results_immediately/> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>2</max_concurrent> <report_results_immediately/> </app> WUs from this app name wonâ€™t be downloaded to the computer, means limit the climateprediction.net WUs on a particular computer further? Thanks! ID: 66851 · Reply Quote