Thread 'OpenIFS Discussion'

Author	Message
Dark Angel Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174	Message 66733 - Posted: 2 Dec 2022, 20:00:57 UTC A few more errors These ones are different: https://www.cpdn.org/result.php?resultid=22246542 https://www.cpdn.org/result.php?resultid=22248013 https://www.cpdn.org/result.php?resultid=22247983 No logs at all. Noted activities: I ran kernel updates on the rig and rebooted. As a precaution I shut down the BOINC client and gave it time to clean things up before restarting but they still errored out. The error I can understand, it's what CPDN units do on reboot, but the lack of any stderr contents is odd. ID: 66733 · Reply Quote

cetus Send message Joined: 7 Aug 04 Posts: 10 Credit: 148,100,750 RAC: 29,951	Message 66734 - Posted: 3 Dec 2022, 1:11:13 UTC I noticed an issue that I don't think has been reported yet. One of my machines has run 46 oifs jobs. 12 of them with computation errors, the rest appear to have completed successfully. After the boinc client finished all the jobs there are still three oifs processes running. No master.exe processes. $ ps -flU boinc F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S boinc 2449 1 0 99 19 - 76998 - Nov29 ? 00:16:34 /usr/bin/boinc --gui_rpc_port 31418 0 S boinc 1657533 2449 0 99 19 - 35465 - Dec01 ? 00:01:18 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1734 946 12164823 123 oifs_43r3_ps 1 0 S boinc 2001745 2449 0 99 19 - 35464 - Nov30 ? 00:02:00 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1833 946 12164922 123 oifs_43r3_ps 1 0 S boinc 2147924 2449 0 99 19 - 35465 - Nov30 ? 00:02:15 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1942 946 12165031 123 oifs_43r3_ps 1 Two of the slots directories still have 6 files in them, another one has more than 300 files in it. The rest of the slots are empty. In the projects/climateprediction.net directory, there are 9 directories with names like oifs_43r3_ps_12163845 that appear to be job folders that did not get deleted after the job finished. I have the BOINC directory archived if the contents are of interest. The computer is a 12 core 5900X with 64GB of ram. The oifs jobs were run 8 at a time. I never noticed less than 15GB free RAM while 8 were running, but I wasn't watching most of the time of course. Here are all the error jobs: https://www.cpdn.org/result.php?resultid=22248825 double free or corruption https://www.cpdn.org/result.php?resultid=22248783 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12164823 https://www.cpdn.org/result.php?resultid=22248662 double free or corruption https://www.cpdn.org/result.php?resultid=22246507 double free or corruption https://www.cpdn.org/result.php?resultid=22248118 double free or corruption https://www.cpdn.org/result.php?resultid=22246441 double free or corruption https://www.cpdn.org/result.php?resultid=22246293 double free or corruption https://www.cpdn.org/result.php?resultid=22247041 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12165031 https://www.cpdn.org/result.php?resultid=22246587 double free or corruption https://www.cpdn.org/result.php?resultid=22248533 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12164922 https://www.cpdn.org/result.php?resultid=22248053 double free or corruption https://www.cpdn.org/result.php?resultid=22246923 double free or corruption The three jobs that were "aborted" look like the same three processes that are still running. I did not abort any manually. ID: 66734 · Reply Quote

Dark Angel Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174	Message 66735 - Posted: 3 Dec 2022, 5:48:50 UTC Three more. No idea what happened as there's nothing in the Stderr https://www.cpdn.org/result.php?resultid=22247956 https://www.cpdn.org/result.php?resultid=22246962 https://www.cpdn.org/result.php?resultid=22247713 I'm running three of these at a time alongside LHC Atlas (5 off each 4 cores) and SRBase on GPU. There doesn't appear to be any extra Master.exe instances running and I'm not running out of RAM that I can see (64GB) ID: 66735 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085	Message 66736 - Posted: 3 Dec 2022, 10:23:55 UTC - in response to Message 66732. xii5ku wrote: About OpenIFS failure modes: All of [my] error results come from only one out of three hosts. All three hosts have the same hardware, OS, boinc client configs, and same split workload of OpenIFS and PrimeGrid llrSGS. The one host with errors was the only one on which I suspended all tasks to disk, rebooted the host, and resumed the tasks. I strongly believe that all of these 54 tasks went through this suspendâ€“resume cycle. [...] The stderr.txts of these tasks are of two types: One type contains just "--". The other shows that the last one to five zip files were missing. The host with errors has reported only successful tasks for a while now, which is another hint that the error episode was just the aftermath of the suspend-resume cycle. I now got a single new error: terminate called after throwing an instance of 'std::invalid_argument' what(): stoi result ID 22248133, workunit name oifs_43r3_ps_3007_2021050100_123_947_12166096 This happened at ~50% computing progress, whereas all of my previous suspend/resume related errors happened at ~100% progress. ID: 66736 · Reply Quote

Conan Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420	Message 66737 - Posted: 3 Dec 2022, 10:26:20 UTC - in response to Message 66718. Just downloaded a resend of a Work Unit that failed due to an error. This Task 22245903 It failed due to running longer than 5 minutes after the work unit had finished. The WU was run by mikey and other than the longer run time after finishing seemed to have run successfully after over 2 days run time. The run time seems overly long on a Ryzen but did complete. It is now running as Task 22249047 on my Ryzen computer. Will see how it runs for me. Conan Completed successfully after 16 1/2 hours. Conan ID: 66737 · Reply Quote

Dark Angel Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174	Message 66738 - Posted: 3 Dec 2022, 10:58:56 UTC An observation ... I just sat here and watched a unit upload a trickle. Waited until it was well and truly done. Checked all my work units listed as "in progress" and ... none of them have trickles registered. ID: 66738 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 66739 - Posted: 3 Dec 2022, 12:03:56 UTC - in response to Message 66738. OpenIFS trickles don't show up on the website like they do for the Hadley models. Progress can be viewed in the stderr.txt file in the appropriate slots directory for a given task. That file gets uploaded and shows up on the website once the task reports as completed. ID: 66739 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 66741 - Posted: 3 Dec 2022, 12:34:40 UTC Technically, a 'trickle' is not the same as an 'upload'. Uploads are processed immediately they're ready, but trickles are bunched and reported to the server at the end of the 'server backoff' delay - possibly up to an hour later. Both processes can be seen, with timestamps, in the Event Log. Having said that, none of my completed (and valid) tasks are showing any trickles yet, either. I think that trickles for other task types are processed by one of the innumerable scripts which run in the background on the server. That may have failed; it may require extending for the IFS tasks; or it may be scheduled to run tomorrow, as part of of the credit updating process. Let's take another look on Monday. ID: 66741 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 66742 - Posted: 3 Dec 2022, 14:04:48 UTC - in response to Message 66741. Technically, a 'trickle' is not the same as an 'upload'. Uploads are processed immediately they're ready, but trickles are bunched and reported to the server at the end of the 'server backoff' delay - possibly up to an hour later. Both processes can be seen, with timestamps, in the Event Log. Having said that, none of my completed (and valid) tasks are showing any trickles yet, either. I think that trickles for other task types are processed by one of the innumerable scripts which run in the background on the server. That may have failed; it may require extending for the IFS tasks; or it may be scheduled to run tomorrow, as part of of the credit updating process. Let's take another look on Monday. I had some HADSM4's running at the start of the week. They are showing trickles but won't I assume show credit till tomorrow. My last OIFS tasks on testing did show trickles. Often in the past trickles have not shown up on the website but credit has still been awarded so I suspect it is a script needing the new model types added to get the trickles to show. Whether credit will show tomorrow for those who care about such things, who knows? I will put a note on the Trello card for these batches tomorrow when i will know if credit is being awarded or not. Andy is on the Trello card so will see it there fairly quickly. ID: 66742 · Reply Quote

klepel Send message Joined: 9 Oct 04 Posts: 82 Credit: 70,035,400 RAC: 247	Message 66743 - Posted: 3 Dec 2022, 14:27:36 UTC - in response to Message 66732. About OpenIFS failure modes: The one host with errors was the only one on which I suspended all tasks to disk, rebooted the host, and resumed the tasks. I strongly believe that all of these 54 tasks went through this suspendâ€“resume.[â€¦] The host with errors has reported only successful tasks for a while now, which is another hint that the error episode was just the aftermath of the suspend-resume cycle. I think I had the same error. The computer was shut-down and restarted. Although BOINC View reported progress, no Tickle / Upload file has been created. https://www.cpdn.org/result.php?resultid=22248984 as you can see there is a difference between CPU-time and execution-time. Then there was an WU with Code 9 error https://www.cpdn.org/result.php?resultid=22248970: 9 (0x00000009) Unknown error code Hope this helps. ID: 66743 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 66748 - Posted: 3 Dec 2022, 17:56:24 UTC - in response to Message 66743. 'Fraid I don't have time to read all these posts right now as I'm busy looking into the various problems. I think they have all been pretty much captured by posts here. And the detailed forensic reports are extremely helpful, so thanks very much for that. I have fed the list back to CPDN. There is an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported. So if you can manage to keep the tasks running uninterrupted they should work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;) ID: 66748 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 83 Credit: 17,184,625 RAC: 13,161	Message 66753 - Posted: 3 Dec 2022, 19:34:54 UTC - in response to Message 66748. Had a task fail after 12 hours with error: free(): unaligned chunk detected in tcache 2 Have completed 5 okay so far, running the same additional project all the time so no change in memory usage. No suspending of work (manually or by Boinc). The master process for the aborted task was still running so terminated it, it showed the same slot as the new task that had started up. ID: 66753 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,836,638 RAC: 3,986	Message 66755 - Posted: 3 Dec 2022, 22:44:31 UTC - in response to Message 66748. 'There is an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported. So if you can manage to keep the tasks running uninterrupted they should work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;) Thanks Glenn. After a lot of hit and miss tasks, the last six 'uninterrupted' OpenIFS tasks here have completed! ID: 66755 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 66756 - Posted: 3 Dec 2022, 22:47:58 UTC - in response to Message 66732. All three of the hosts which I have active at OpenIFS have plenty of RAM, and are set to "leave non-GPU tasks in memory while suspended".Â¹ That's possibly a factor why they run error-free. Me too. Of all the 21 OpenIFS tasks I have received (and now completed), all completed successfully. They have run three at a time when I had enough tasks available (that was most of the time). By the time it completed, the last one was running all by itself. I, too, have plenty of RAM: (64 GigaBytes), "leave non-GPU tasks in memory while suspended" is enabled ID: 66756 · Reply Quote

Vato Send message Joined: 4 Oct 19 Posts: 16 Credit: 9,174,915 RAC: 3,722	Message 66757 - Posted: 3 Dec 2022, 22:51:09 UTC - in response to Message 66748. So if you can manage to keep the tasks running uninterrupted they should work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;) I deliberately suspended a task for 8+ hours using the 'keep non-gpu tasks in memory' option, and it completed and returned the task seemingly cleanly and successfully. ID: 66757 · Reply Quote

Vato Send message Joined: 4 Oct 19 Posts: 16 Credit: 9,174,915 RAC: 3,722	Message 66763 - Posted: 4 Dec 2022, 10:16:31 UTC Credit granted to all my tasks - 2,353 each ID: 66763 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 66765 - Posted: 4 Dec 2022, 10:48:19 UTC - in response to Message 66763. Credit granted to all my tasks - 2,353 each Yes, but still no trickles visible. They are visible on the dev site, so I think it's a scripting issue on the main server. ID: 66765 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 66766 - Posted: 4 Dec 2022, 11:58:08 UTC - in response to Message 66763. Credit granted to all my tasks - 2,353 each Yes, but too late. My boincmgr Statistics Screen reveals a big drop last night, and the Projects Screen gives about a 50% drop since my previous report. Sad since I completed 21 or so of the OpenIFS tasks successfully. Sigh! ID: 66766 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 66767 - Posted: 4 Dec 2022, 11:58:48 UTC - in response to Message 66765. Credit granted to all my tasks - 2,353 each Yes, but still no trickles visible. They are visible on the dev site, so I think it's a scripting issue on the main server. Message left for Andy on Trello board. ID: 66767 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 66777 - Posted: 5 Dec 2022, 2:43:16 UTC - in response to Message 66748. There is an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported. So if you can manage to keep the tasks running uninterrupted they should work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;) I wonder if Hadley models restart issues are also due to a similar wrapper issue? Or is it the models issue? ID: 66777 · Reply Quote