Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
A few more errors These ones are different: https://www.cpdn.org/result.php?resultid=22246542 https://www.cpdn.org/result.php?resultid=22248013 https://www.cpdn.org/result.php?resultid=22247983 No logs at all. Noted activities: I ran kernel updates on the rig and rebooted. As a precaution I shut down the BOINC client and gave it time to clean things up before restarting but they still errored out. The error I can understand, it's what CPDN units do on reboot, but the lack of any stderr contents is odd. |
Send message Joined: 7 Aug 04 Posts: 10 Credit: 148,051,879 RAC: 37,571 |
I noticed an issue that I don't think has been reported yet. One of my machines has run 46 oifs jobs. 12 of them with computation errors, the rest appear to have completed successfully. After the boinc client finished all the jobs there are still three oifs processes running. No master.exe processes. $ ps -flU boinc F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S boinc 2449 1 0 99 19 - 76998 - Nov29 ? 00:16:34 /usr/bin/boinc --gui_rpc_port 31418 0 S boinc 1657533 2449 0 99 19 - 35465 - Dec01 ? 00:01:18 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1734 946 12164823 123 oifs_43r3_ps 1 0 S boinc 2001745 2449 0 99 19 - 35464 - Nov30 ? 00:02:00 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1833 946 12164922 123 oifs_43r3_ps 1 0 S boinc 2147924 2449 0 99 19 - 35465 - Nov30 ? 00:02:15 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1942 946 12165031 123 oifs_43r3_ps 1 Two of the slots directories still have 6 files in them, another one has more than 300 files in it. The rest of the slots are empty. In the projects/climateprediction.net directory, there are 9 directories with names like oifs_43r3_ps_12163845 that appear to be job folders that did not get deleted after the job finished. I have the BOINC directory archived if the contents are of interest. The computer is a 12 core 5900X with 64GB of ram. The oifs jobs were run 8 at a time. I never noticed less than 15GB free RAM while 8 were running, but I wasn't watching most of the time of course. Here are all the error jobs: https://www.cpdn.org/result.php?resultid=22248825 double free or corruption https://www.cpdn.org/result.php?resultid=22248783 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12164823 https://www.cpdn.org/result.php?resultid=22248662 double free or corruption https://www.cpdn.org/result.php?resultid=22246507 double free or corruption https://www.cpdn.org/result.php?resultid=22248118 double free or corruption https://www.cpdn.org/result.php?resultid=22246441 double free or corruption https://www.cpdn.org/result.php?resultid=22246293 double free or corruption https://www.cpdn.org/result.php?resultid=22247041 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12165031 https://www.cpdn.org/result.php?resultid=22246587 double free or corruption https://www.cpdn.org/result.php?resultid=22248533 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12164922 https://www.cpdn.org/result.php?resultid=22248053 double free or corruption https://www.cpdn.org/result.php?resultid=22246923 double free or corruption The three jobs that were "aborted" look like the same three processes that are still running. I did not abort any manually. |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
Three more. No idea what happened as there's nothing in the Stderr https://www.cpdn.org/result.php?resultid=22247956 https://www.cpdn.org/result.php?resultid=22246962 https://www.cpdn.org/result.php?resultid=22247713 I'm running three of these at a time alongside LHC Atlas (5 off each 4 cores) and SRBase on GPU. There doesn't appear to be any extra Master.exe instances running and I'm not running out of RAM that I can see (64GB) |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
xii5ku wrote: About OpenIFS failure modes: I now got a single new error:
what(): stoi result ID 22248133, workunit name oifs_43r3_ps_3007_2021050100_123_947_12166096 |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Just downloaded a resend of a Work Unit that failed due to an error. Completed successfully after 16 1/2 hours. Conan |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
An observation ... I just sat here and watched a unit upload a trickle. Waited until it was well and truly done. Checked all my work units listed as "in progress" and ... none of them have trickles registered. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,845,098 RAC: 19,856 |
OpenIFS trickles don't show up on the website like they do for the Hadley models. Progress can be viewed in the stderr.txt file in the appropriate slots directory for a given task. That file gets uploaded and shows up on the website once the task reports as completed. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Technically, a 'trickle' is not the same as an 'upload'. Uploads are processed immediately they're ready, but trickles are bunched and reported to the server at the end of the 'server backoff' delay - possibly up to an hour later. Both processes can be seen, with timestamps, in the Event Log. Having said that, none of my completed (and valid) tasks are showing any trickles yet, either. I think that trickles for other task types are processed by one of the innumerable scripts which run in the background on the server. That may have failed; it may require extending for the IFS tasks; or it may be scheduled to run tomorrow, as part of of the credit updating process. Let's take another look on Monday. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762 |
Technically, a 'trickle' is not the same as an 'upload'. Uploads are processed immediately they're ready, but trickles are bunched and reported to the server at the end of the 'server backoff' delay - possibly up to an hour later. Both processes can be seen, with timestamps, in the Event Log. I had some HADSM4's running at the start of the week. They are showing trickles but won't I assume show credit till tomorrow. My last OIFS tasks on testing did show trickles. Often in the past trickles have not shown up on the website but credit has still been awarded so I suspect it is a script needing the new model types added to get the trickles to show. Whether credit will show tomorrow for those who care about such things, who knows? I will put a note on the Trello card for these batches tomorrow when i will know if credit is being awarded or not. Andy is on the Trello card so will see it there fairly quickly. |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,923,532 RAC: 8,011 |
About OpenIFS failure modes:I think I had the same error. The computer was shut-down and restarted. Although BOINC View reported progress, no Tickle / Upload file has been created. https://www.cpdn.org/result.php?resultid=22248984 as you can see there is a difference between CPU-time and execution-time. Then there was an WU with Code 9 error https://www.cpdn.org/result.php?resultid=22248970: 9 (0x00000009) Unknown error code Hope this helps. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
'Fraid I don't have time to read all these posts right now as I'm busy looking into the various problems. I think they have all been pretty much captured by posts here. And the detailed forensic reports are extremely helpful, so thanks very much for that. I have fed the list back to CPDN. There *is* an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported. So if you can manage to keep the tasks running uninterrupted they *should* work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;) |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,559,938 RAC: 89,548 |
Had a task fail after 12 hours with error: free(): unaligned chunk detected in tcache 2 Have completed 5 okay so far, running the same additional project all the time so no change in memory usage. No suspending of work (manually or by Boinc). The master process for the aborted task was still running so terminated it, it showed the same slot as the new task that had started up. |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,347,450 RAC: 10,508 |
'There *is* an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported.Thanks Glenn. After a lot of hit and miss tasks, the last six 'uninterrupted' OpenIFS tasks here have completed! |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
All three of the hosts which I have active at OpenIFS have plenty of RAM, and are set to "leave non-GPU tasks in memory while suspended".¹ That's possibly a factor why they run error-free. Me too. Of all the 21 OpenIFS tasks I have received (and now completed), all completed successfully. They have run three at a time when I had enough tasks available (that was most of the time). By the time it completed, the last one was running all by itself. I, too, have plenty of RAM: (64 GigaBytes), "leave non-GPU tasks in memory while suspended" is enabled |
Send message Joined: 4 Oct 19 Posts: 15 Credit: 9,174,915 RAC: 3,722 |
So if you can manage to keep the tasks running uninterrupted they *should* work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;) I deliberately suspended a task for 8+ hours using the 'keep non-gpu tasks in memory' option, and it completed and returned the task seemingly cleanly and successfully. |
Send message Joined: 4 Oct 19 Posts: 15 Credit: 9,174,915 RAC: 3,722 |
Credit granted to all my tasks - 2,353 each |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Credit granted to all my tasks - 2,353 eachYes, but still no trickles visible. They are visible on the dev site, so I think it's a scripting issue on the main server. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Credit granted to all my tasks - 2,353 each Yes, but too late. My boincmgr Statistics Screen reveals a big drop last night, and the Projects Screen gives about a 50% drop since my previous report. Sad since I completed 21 or so of the OpenIFS tasks successfully. Sigh! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762 |
Message left for Andy on Trello board.Credit granted to all my tasks - 2,353 eachYes, but still no trickles visible. They are visible on the dev site, so I think it's a scripting issue on the main server. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,845,098 RAC: 19,856 |
There *is* an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported. I wonder if Hadley models restart issues are also due to a similar wrapper issue? Or is it the models issue? |
©2024 cpdn.org