Thread 'OpenIFS Discussion'

Author	Message
Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 68462 - Posted: 25 Feb 2023, 9:27:51 UTC - in response to Message 68458. This task failed with a divide by zero error. Presumably this is one of those cases where the physics of the model get out of control? Yes. It blew up in the model's convection code, clouds etc. Ran for a long time though, 49 days, before the instability occurred. I would put a beer that the rerun might be successful. --- CPDN Visiting Scientist ID: 68462 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 68463 - Posted: 25 Feb 2023, 9:35:46 UTC Task 22316388 failed with "process exited with code 9 (0x9, -247)". But there's no error in the portion of stderr.txt that we can see (from upload 97 to the end). I can only guess that there was a child process error earlier in the run: the restart succeeded, but the error flag wasn't cleared from the BOINC task status. The final task finish looks normal, with: ..The child process terminated with status: 0 ... Uploading the final file: upload_file_122.zip Uploading trickle at timestep: 10623600 07:35:35 (41942): called boinc_finish(0) That's going to be a tough one to debug. ID: 68463 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 68465 - Posted: 25 Feb 2023, 12:15:54 UTC - in response to Message 68463. Last modified: 25 Feb 2023, 12:16:59 UTC Task 22316388 failed with "process exited with code 9 (0x9, -247)". But there's no error in the portion of stderr.txt that we can see (from upload 97 to the end). I can only guess that there was a child process error earlier in the run: the restart succeeded, but the error flag wasn't cleared from the BOINC task status. The final task finish looks normal, with: ..The child process terminated with status: 0 ... Uploading the final file: upload_file_122.zip Uploading trickle at timestep: 10623600 07:35:35 (41942): called boinc_finish(0) That's going to be a tough one to debug. I am inclined to think this is a boinc issue, not ours. The output shows the model & task completed normally, all log files look ok, boinc_finish() was called... and then code 9 (EBADF: bad filenumber). I think it's to do with the final cleanup but quite what I am not sure. Or it may be that boinc expects us to be doing something we're not doing. Either way, it's file related and happens either inside boinc_finish or just as the task code exits after boinc_finish. I was going to look again at the way we clean up in the task to see if we missed anything. --- CPDN Visiting Scientist ID: 68465 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 68466 - Posted: 25 Feb 2023, 13:08:05 UTC - in response to Message 68465. I've pulled the overnight event log from the system journal, but there are no signs of any errors in there - seemed to be a normal finish after the final zip. ID: 68466 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68467 - Posted: 25 Feb 2023, 16:34:59 UTC - in response to Message 68450. My fastest and best working cruncher had got lastly the definitiv dead tasks, that all errored out and now it has a daily Quota of 1. :-( OK: I now have three losers and two winners. No more in the hopper. OpenIFS 43r3 1.21 x86_64-pc-linux-gnu Number of tasks completed 2 Max tasks per day 5 Number of tasks today 1 Consecutive valid tasks 2 Average processing rate 29.66 GFLOPS Average turnaround time 0.62 days All OpenIFS 43r3 tasks for computer 1511241 22316084 12214703 24 Feb 2023, 22:24:03 UTC 25 Feb 2023, 13:43:22 UTC Completed 53,538.60 52,734.92 0.00 OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314976 12213630 24 Feb 2023, 12:25:29 UTC 25 Feb 2023, 3:03:19 UTC Completed 52,615.79 51,784.63 0.00 OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314676 12213385 22 Feb 2023, 6:23:59 UTC 22 Feb 2023, 7:24:41 UTC Error while computing 66.16 1.15 --- OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314647 12213316 22 Feb 2023, 3:24:44 UTC 22 Feb 2023, 3:49:31 UTC Error while computing 66.61 1.28 --- OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314608 12213345 22 Feb 2023, 0:25:23 UTC 22 Feb 2023, 1:23:20 UTC Error while computing 66.38 1.15 --- OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu ID: 68467 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 68468 - Posted: 25 Feb 2023, 17:00:06 UTC OpenIFS standalone high resolution (high memory) tests Thanks to those who messaged me they were interested in these. For family reasons I am not able to spend much time on CPDN voluntary work at the moment but I will get to this as soon as I can. --- CPDN Visiting Scientist ID: 68468 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4563 Credit: 19,039,635 RAC: 18,944	Message 68471 - Posted: 25 Feb 2023, 21:59:49 UTC Last modified: 25 Feb 2023, 22:00:30 UTC Another failure this time with, CNT0 not found; string returned was: 'STEPO' here ID: 68471 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 68472 - Posted: 25 Feb 2023, 22:27:54 UTC - in response to Message 68471. Last modified: 25 Feb 2023, 22:28:50 UTC Dave, look further back in the stderr output. You'll see the model fell over, in the radiation code this time. The 'STEPO' message is just a check the control code does on the model output to see where's it got to and if it's still working (which it wasn't in this case). The extra printout at the bottom is just so we can figure out what went wrong. The model output will always been further back. On the plus side I'm confident I've pinned down where the double free corruption is coming from. It's in the code that handles the trickles, which has now been rewritten. Another failure this time with, CNT0 not found; string returned was: 'STEPO' here --- CPDN Visiting Scientist ID: 68472 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68473 - Posted: 26 Feb 2023, 6:54:11 UTC - in response to Message 68323. You're also the only person that I've seen who uses RHEL. I wonder if Glenn has seen any correlations between failure rates and Linux distros? I should mention that the main differences between RHEL distributions and Fedora distributions is that Fedora releases are quite a more recent in the sense of additions and enhancements, whereas the RHEL distributions are meant for stability and tend to have no enhancements at all other than Thunderbird and Firefox. Even those two are "extended support" releases; e.g., my Firefox is 102.8.0esr (64-bit). So new updates in Fedora are much more frequent than those for RHEL. RHEL tends to have a major release about every 18 months, and each release is supported for 10 years. I do not know what the support for Fedora releases are. ID: 68473 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4563 Credit: 19,039,635 RAC: 18,944	Message 68474 - Posted: 26 Feb 2023, 7:30:03 UTC - in response to Message 68472. Dave, look further back in the stderr output. You'll see the model fell over, in the radiation code this time. Thanks, I did look but must have missed it. ID: 68474 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 68477 - Posted: 26 Feb 2023, 10:06:34 UTC - in response to Message 68462. Dave, as I suspected, the resend of your failed task (Ryzen) worked fine. It landed on a Intel Xeon and completed. Another example of probably single bit differences in computation making a difference in parts of the model that are very sensitive to small changes when comparing two numbers. It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :) https://www.cpdn.org/workunit.php?wuid=12215285 This task failed with a divide by zero error. Presumably this is one of those cases where the physics of the model get out of control? Yes. It blew up in the model's convection code, clouds etc. Ran for a long time though, 49 days, before the instability occurred. I would put a beer that the rerun might be successful. --- CPDN Visiting Scientist ID: 68477 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4563 Credit: 19,039,635 RAC: 18,944	Message 68478 - Posted: 26 Feb 2023, 10:26:37 UTC - in response to Message 68477. Dave, as I suspected, the resend of your failed task (Ryzen) worked fine. It landed on a Intel Xeon and completed. Another example of probably single bit differences in computation making a difference in parts of the model that are very sensitive to small changes when comparing two numbers. It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :) And this one completed for me on its final chance. One of the two others was a Ryzen, one, Intel. which I suspect lack of memory accounts for the failure rate. The Ryzon failed with with ABORT! 1 !! *** WAVE MODEL HAS ABORTED ID: 68478 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 68479 - Posted: 26 Feb 2023, 10:33:42 UTC - in response to Message 68478. Dave, as I suspected, the resend of your failed task (Ryzen) worked fine. It landed on a Intel Xeon and completed. Another example of probably single bit differences in computation making a difference in parts of the model that are very sensitive to small changes when comparing two numbers. It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :) And this one completed for me on its final chance. One of the two others was a Ryzen, one, Intel. which I suspect lack of memory accounts for the failure rate. The Ryzon failed with with ABORT! 1 !! *** WAVE MODEL HAS ABORTED The wave model abort is an indication of an earlier failed task with memory corruption. --- CPDN Visiting Scientist ID: 68479 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4563 Credit: 19,039,635 RAC: 18,944	Message 68480 - Posted: 26 Feb 2023, 11:07:56 UTC - in response to Message 68479. My last one which is currently 65% completed failed after 16hours CPU time on another machine with only, <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> </stderr_txt> ]]> In the Stderr. which seems a bit odd. I am assuming that the machine is running too many tasks for the amount of RAM/ culprit here. ID: 68480 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 68481 - Posted: 26 Feb 2023, 12:45:09 UTC - in response to Message 68480. Yep, I agree. That machine (https://www.cpdn.org/show_host_detail.php?hostid=1531276) has barely had any successful tasks complete. Utter waste of compute time. I really wish people would not make themselves anonymous as then I could message them and help sort out any issues. I'm surprised they haven't tried to look into it more given the failure rate but maybe they are running other projects and not interested. My last one which is currently 65% completed failed after 16hours CPU time on another machine with only, <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> </stderr_txt> ]]> In the Stderr. which seems a bit odd. I am assuming that the machine is running too many tasks for the amount of RAM/ culprit here. --- CPDN Visiting Scientist ID: 68481 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68482 - Posted: 26 Feb 2023, 13:24:11 UTC - in response to Message 68481. Yep, I agree. That machine (https://www.cpdn.org/show_host_detail.php?hostid=1531276) has barely had any successful tasks complete. Utter waste of compute time. I really wish people would not make themselves anonymous as then I could message them and help sort out any issues. I'm surprised they haven't tried to look into it more given the failure rate but maybe they are running other projects and not interested. I looked at a few of the tasks on that machine and have some impressions. Not that there is anything wrong with the machine, but ... 1.) It gets a fantastically large number of suspends and resumes .... 2.) It is running CentOS 7 distribution. CentOS is much like RHEL, so it is like RHEL7. I happen to be running Red Hat Enterprise Linux release 8.7 (Ootpa) that is a whole new generation (i.e., about 1 1/2 years newer) than 7. And RHEL9 has been available for a while now. 3.) Similarly, I am running boinc-client 7.20.2, but that machine is running 7.16.something. 4.) It has only 32 GBytes of RAM. That is not necessarily bad, depending on what tasks are being run. If too many, that might account for all those suspends and resumes. Now none of these is necessarily a problem, but taken together ... ? I think #1 is very suspicious. ID: 68482 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 68484 - Posted: 26 Feb 2023, 15:13:50 UTC - in response to Message 68482. Yep, I agree. That machine (https://www.cpdn.org/show_host_detail.php?hostid=1531276) has barely had any successful tasks complete. Utter waste of compute time. I really wish people would not make themselves anonymous as then I could message them and help sort out any issues. I'm surprised they haven't tried to look into it more given the failure rate but maybe they are running other projects and not interested. I looked at a few of the tasks on that machine and have some impressions. Not that there is anything wrong with the machine, but ... 1.) It gets a fantastically large number of suspends and resumes .... I think #1 is very suspicious. A large amount of suspend/resumes is not a problem, not unless 'leave non-gpu in memory is not set', which it isn't in this case, otherwise we'd see the model constantly restarting. Just means the %age cpu is set to much less than 100%. But when I look at the task logs, almost always the model is being killed by signal 9 (kill -SIGKILL). In some cases when it's almost finished. I don't know if boinc is sending the signal. Maybe Richard might know. I would have thought boinc would send a different signal (SIGQUIT?), which should just stop the task but allow it to restart later. Maybe the operating system killed it because of memory oversubscription as Dave mentioned (OOM killer). There's not much to go on in the task output. It's not anything to do with the OS flavour in use. It may not be the fault of the user of course. We know that the boinc client doesn't keep very good track of memory used by OpenIFS. --- CPDN Visiting Scientist ID: 68484 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4563 Credit: 19,039,635 RAC: 18,944	Message 68485 - Posted: 26 Feb 2023, 16:49:51 UTC For the record, the task completed on my box. Now waiting for more work! ID: 68485 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68487 - Posted: 26 Feb 2023, 19:13:07 UTC - in response to Message 68484. A large amount of suspend/resumes is not a problem, not unless 'leave non-gpu in memory is not set', which it isn't in this case, otherwise we'd see the model constantly restarting. Just means the %age cpu is set to much less than 100%. I am prepared to accept that in this case. But I still do not understand this faith that leave non-gpu in memory guarantees that the task(s) so marked will be guaranteed to be kept in memory. Here is my boinc usage of my machine at the moment. PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 260110 2207 boinc 39 19 R 1.9g 1.5 98.9 3 173:48.59 ../../projects/einstein.phys.uwm.edu/einstein_O3MD1_1.03_x86_64-pc-linux+ 269459 2207 boinc 39 19 R 326720 0.2 98.9 2 28:02.31 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 261928 2207 boinc 39 19 R 321496 0.2 99.0 4 143:13.60 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 263896 2207 boinc 39 19 R 317356 0.2 99.4 9 113:54.09 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 2207 1 boinc 30 10 S 41324 0.0 0.1 8 44761:25 /usr/bin/boinc <---<<< boinc client 269447 2207 boinc 39 19 R 7020 0.0 99.2 1 28:23.18 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 270386 2207 boinc 39 19 R 6996 0.0 99.2 7 14:32.60 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 270663 2207 boinc 39 19 R 5884 0.0 99.0 13 9:51.22 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 267477 2207 boinc 39 19 R 4984 0.0 99.1 8 60:56.19 ../../projects/universeathome.pl_universe/BHspin2_20_x86_64-pc-linux-gnu 268160 2207 boinc 39 19 R 4560 0.0 99.1 0 48:48.88 ../../projects/universeathome.pl_universe/BHspin2_20_x86_64-pc-linux-gnu 268791 2207 boinc 39 19 R 4480 0.0 99.3 6 39:17.70 ../../projects/universeathome.pl_universe/BHspin2_20_x86_64-pc-linux-gnu The column labelled PR (Priority) shows that the task processes run at priority 39, and that the boinc-client runs at priority 30. Now in Linux (and UNIX), the higher the "priority" number, the less likely a process will run. Furthermore a process not assigned a priority (i.e., most processes) runs at PR 20. So if some one wants to run a new process that will run at PR 20, it will force an existing process with a higher PR to lose control of its processor so that the PR 20 process can get one. The column labelled S (Status) is R for running and S for sleeping. There are other possibilities, but none apply here. Now if this process needs more RAM than is currently available, what happens? I have not studied the code of the Linux kernel ever, and I never looked at the code for UNIX since the 1970s (when it was written in assembler for the PDP-11), and even then I was no expert. But it would seem to me that it would normally just swap out the process with higher PR to make room for the new one. But I do not believe in those days one could lock a process to core. So what happens now? It seems to me the kernel has two choices: 1.) ignore the leave non-gpu in memory and swap the process out or 2.) refuse to start the new process. I there a third option? If not, which is done? It seems to me that neither of the two options I suggested are acceptable. ID: 68487 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 68490 - Posted: 26 Feb 2023, 20:30:44 UTC - in response to Message 68477. Last modified: 26 Feb 2023, 20:31:43 UTC It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :) Is there a reason you can't? Send out a couple hundred otherwise identical WUs in a few batches and compare/contrast results? I don't think there's a shortage of willing CPU cores right now. Or even just have some people run the binaries manually and send you results somehow. I've got a range of AMD systems that are mostly bored! ID: 68490 · Reply Quote