Why do tasks crash on some machines but not others?

Author	Message
Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 61484 - Posted: 7 Nov 2019, 14:03:37 UTC I was looking at some tasks that I have run that failed for others, typically two others. One of these was Workunit 11901525 Now some of them crash after a large fraction of a second, or a few seconds. I am ignoring these. But some run a long time, such as Task 21754793 This one ran CPU time 2 days 1 hours 32 min 7 sec And the failure was <stderr_txt> CPDN Monitor - Quit request from BOINC... Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED. tmp/xnnuj.pipe_dummy Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED. tmp/xnnuj.pipe_dummy Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED. tmp/xnnuj.pipe_dummy Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED. tmp/xnnuj.pipe_dummy Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED. tmp/xnnuj.pipe_dummy Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED. tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( Now if the program had bugs, or if the initial data were bad, would it not have crashed for me too? But since I completed it correctly, it seems to me that the program probably had no bugs, and that the initial data were good too. So why did the Model crash for the others? ID: 61484 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 61486 - Posted: 7 Nov 2019, 14:27:08 UTC Research way back at the start of this project showed that starting from the same initial conditions on different computers could lead to different results. How different depended on the difference between the computers. So those that you've looked at are probably different enough to push the model into an unstable condition in some cases. e.g. Overclocking slightly. ID: 61486 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 61487 - Posted: 7 Nov 2019, 14:36:04 UTC I would also agree. Some of the negative pressure/negative theta errors are completely repeatable and all tasks in a a work unit that get that far will crash at virtually the same model progress point. These are due to the scientists testing the limits of the parameters used in the mode, that sometimes lead to unrealistic atmospheres. For others, it might be due to hardware that is not quite up to the task when working hard and/or overheats, or has bad memory that only fails under certain conditions. ID: 61487 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 61491 - Posted: 7 Nov 2019, 16:25:06 UTC - in response to Message 61487. Some of the negative pressure/negative theta errors are completely repeatable and all tasks in a a work unit that get that far will crash at virtually the same model progress point. These are due to the scientists testing the limits of the parameters used in the mode, that sometimes lead to unrealistic atmospheres. To the extent that this is true, it would explain why my task crashed too. But it did not crash. It went all the way to a successful completion. For others, it might be due to hardware that is not quite up to the task when working hard and/or overheats, or has bad memory that only fails under certain conditions. Mine does not overheat even though it is currently running four hadam4h N216 processes; its fan is not even running fast. Since on the work unit I posted, two computers failed before I got mine, the one that completed. I realize the following is bad statistics, but I could conclude that my computer is better than 2/3 of those working on this. My guess is that it is not as bad as that because I did not look at all those who completed on the very first try, or the first two. ID: 61491 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 61492 - Posted: 7 Nov 2019, 18:49:52 UTC - in response to Message 61491. Some of the negative pressure/negative theta errors are completely repeatable and all tasks in a a work unit that get that far will crash at virtually the same model progress point. These are due to the scientists testing the limits of the parameters used in the mode, that sometimes lead to unrealistic atmospheres. To the extent that this is true, it would explain why my task crashed too. But it did not crash. It went all the way to a successful completion. For others, it might be due to hardware that is not quite up to the task when working hard and/or overheats, or has bad memory that only fails under certain conditions. Mine does not overheat even though it is currently running four hadam4h N216 processes; its fan is not even running fast. Since on the work unit I posted, two computers failed before I got mine, the one that completed. I realize the following is bad statistics, but I could conclude that my computer is better than 2/3 of those working on this. My guess is that it is not as bad as that because I did not look at all those who completed on the very first try, or the first two. I simply gave you a couple reasons for negative pressure errors, not implying that the model crashed on your PC, which you had clearly stated it had not. I'm assuming this is the work unit that you were writing about with the negative pressure. https://www.cpdn.org/workunit.php?wuid=11901525 Both crashes on that work unit were after one trickle, and both by Ryzens as opposed to any Intel or earlier AMD CPUs. So the thing Les mentioned about different computers producing different results for the same parameters and initial conditions would appear to be in play here. Back in the day when we would see stuff like this, one of the project scientists said that if this happened, on otherwise stable computers (like those Ryzens), that it's because the parameters were pushing the boundaries that would produce an unstable atmosphere. Small differences in the calculation for the tasks in such a work unit, between different CPU types, over computation time in the run means one may veer off into instability and the other produces a stable (but who knows how realistic) atmosphere. They conducted a study and produced a publication back in 2007 with information on the types of differences one might see between CPU types for given simulations. Association of parameter, software and hardware variation with large scale behavior across 57,000 climate models. It doesn't exactly answer your question directly, but it talks about the differences in output and what it means to the ensembles. ID: 61492 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,971,712 RAC: 21,921	Message 61493 - Posted: 7 Nov 2019, 21:30:55 UTC - in response to Message 61492. Another question which is often asked with regards to this is what does it mean for the reliability of the models? Certainly reading the previous posts I would be asking that if new here. The answer is that a statistical analysis program is used to assess the validity of the results which throws out some of the results that are deemed to be way off the mark. I am afraid the statistical tools used for this are almost certainly beyond the level I have studied statistics to so I can't comment further on that. ID: 61493 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 61494 - Posted: 8 Nov 2019, 0:08:13 UTC - in response to Message 61492. They conducted a study and produced a publication back in 2007 with information on the types of differences one might see between CPU types for given simulations. Association of parameter, software and hardware variation with large scale behavior across 57,000 climate models. It doesn't exactly answer your question directly, but it talks about the differences in output and what it means to the ensembles. Thank-you. That is a really interesting paper. ID: 61494 · Reply Quote

KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 61549 - Posted: 16 Nov 2019, 17:13:10 UTC Do you have a GPU which is also crunching for BOINC and have you given over any CPU to it or running 100% CPU's? I had this problem but now I only run tasks on 50% CPU's which at least for me solved it. I think the GPU also needs its space. Anyway, even with BOINC running on 50% CPU's, the load in task manager reaches 70% sometimes. It is the GPU consuming the rest of the CPU's. The benefit is also that my CPU clock speed has increased. ID: 61549 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,971,712 RAC: 21,921	Message 61550 - Posted: 16 Nov 2019, 17:30:13 UTC - in response to Message 61549. Do you have a GPU which is also crunching for BOINC and have you given over any CPU to it or running 100% CPU's? I had this problem but now I only run tasks on 50% CPU's which at least for me solved it. I think the GPU also needs its space. Anyway, even with BOINC running on 50% CPU's, the load in task manager reaches 70% sometimes. It is the GPU consuming the rest of the CPU's. The benefit is also that my CPU clock speed has increased. I think it is more complex than that. On all operating systems stopping and restarting BOINC can lead to problems but more so with Linux tasks at least in my experience. Certainly overclocking makes it more likely. I have never had a GPU that will work with BOINC but I can believe that crunching on GPU alongside CPDN work could increase the risk. On some recent machines, running on all 40 threads means there is not enough cache memory on the CPU and there is a machine out there crashing everything with errors due to that. But the bottom line is there are several causes and some I suspect will always remain conjecture. ID: 61550 · Reply Quote