climateprediction.net home page
Why do tasks crash on some machines but not others?

Why do tasks crash on some machines but not others?

Message boards : Number crunching : Why do tasks crash on some machines but not others?
Message board moderation

To post messages, you must log in.

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61484 - Posted: 7 Nov 2019, 14:03:37 UTC

I was looking at some tasks that I have run that failed for others, typically two others.

One of these was Workunit 11901525

Now some of them crash after a large fraction of a second, or a few seconds. I am ignoring these.

But some run a long time, such as Task 21754793

This one ran CPU time 2 days 1 hours 32 min 7 sec

And the failure was
<stderr_txt>
CPDN Monitor - Quit request from BOINC...

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            
Sorry, too many model crashes! :-(


Now if the program had bugs, or if the initial data were bad, would it not have crashed for me too?
But since I completed it correctly, it seems to me that the program probably had no bugs, and that the initial data were good too. So why did the Model crash for the others?
ID: 61484 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61486 - Posted: 7 Nov 2019, 14:27:08 UTC

Research way back at the start of this project showed that starting from the same initial conditions on different computers could lead to different results.
How different depended on the difference between the computers.

So those that you've looked at are probably different enough to push the model into an unstable condition in some cases.
e.g. Overclocking slightly.
ID: 61486 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2185
Credit: 64,822,615
RAC: 5,275
Message 61487 - Posted: 7 Nov 2019, 14:36:04 UTC

I would also agree. Some of the negative pressure/negative theta errors are completely repeatable and all tasks in a a work unit that get that far will crash at virtually the same model progress point. These are due to the scientists testing the limits of the parameters used in the mode, that sometimes lead to unrealistic atmospheres.

For others, it might be due to hardware that is not quite up to the task when working hard and/or overheats, or has bad memory that only fails under certain conditions.
ID: 61487 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61491 - Posted: 7 Nov 2019, 16:25:06 UTC - in response to Message 61487.  

Some of the negative pressure/negative theta errors are completely repeatable and all tasks in a a work unit that get that far will crash at virtually the same model progress point. These are due to the scientists testing the limits of the parameters used in the mode, that sometimes lead to unrealistic atmospheres.


To the extent that this is true, it would explain why my task crashed too. But it did not crash. It went all the way to a successful completion.

For others, it might be due to hardware that is not quite up to the task when working hard and/or overheats, or has bad memory that only fails under certain conditions.


Mine does not overheat even though it is currently running four hadam4h N216 processes; its fan is not even running fast.

Since on the work unit I posted, two computers failed before I got mine, the one that completed. I realize the following is bad statistics, but I could conclude that my computer is better than 2/3 of those working on this. My guess is that it is not as bad as that because I did not look at all those who completed on the very first try, or the first two.
ID: 61491 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2185
Credit: 64,822,615
RAC: 5,275
Message 61492 - Posted: 7 Nov 2019, 18:49:52 UTC - in response to Message 61491.  

Some of the negative pressure/negative theta errors are completely repeatable and all tasks in a a work unit that get that far will crash at virtually the same model progress point. These are due to the scientists testing the limits of the parameters used in the mode, that sometimes lead to unrealistic atmospheres.


To the extent that this is true, it would explain why my task crashed too. But it did not crash. It went all the way to a successful completion.

For others, it might be due to hardware that is not quite up to the task when working hard and/or overheats, or has bad memory that only fails under certain conditions.


Mine does not overheat even though it is currently running four hadam4h N216 processes; its fan is not even running fast.

Since on the work unit I posted, two computers failed before I got mine, the one that completed. I realize the following is bad statistics, but I could conclude that my computer is better than 2/3 of those working on this. My guess is that it is not as bad as that because I did not look at all those who completed on the very first try, or the first two.

I simply gave you a couple reasons for negative pressure errors, not implying that the model crashed on your PC, which you had clearly stated it had not.

I'm assuming this is the work unit that you were writing about with the negative pressure. https://www.cpdn.org/workunit.php?wuid=11901525

Both crashes on that work unit were after one trickle, and both by Ryzens as opposed to any Intel or earlier AMD CPUs. So the thing Les mentioned about different computers producing different results for the same parameters and initial conditions would appear to be in play here. Back in the day when we would see stuff like this, one of the project scientists said that if this happened, on otherwise stable computers (like those Ryzens), that it's because the parameters were pushing the boundaries that would produce an unstable atmosphere. Small differences in the calculation for the tasks in such a work unit, between different CPU types, over computation time in the run means one may veer off into instability and the other produces a stable (but who knows how realistic) atmosphere.

They conducted a study and produced a publication back in 2007 with information on the types of differences one might see between CPU types for given simulations. Association of parameter, software and hardware variation with large scale behavior across 57,000 climate models. It doesn't exactly answer your question directly, but it talks about the differences in output and what it means to the ensembles.
ID: 61492 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,966,742
RAC: 21,869
Message 61493 - Posted: 7 Nov 2019, 21:30:55 UTC - in response to Message 61492.  

Another question which is often asked with regards to this is what does it mean for the reliability of the models? Certainly reading the previous posts I would be asking that if new here.

The answer is that a statistical analysis program is used to assess the validity of the results which throws out some of the results that are deemed to be way off the mark.

I am afraid the statistical tools used for this are almost certainly beyond the level I have studied statistics to so I can't comment further on that.
ID: 61493 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61494 - Posted: 8 Nov 2019, 0:08:13 UTC - in response to Message 61492.  

They conducted a study and produced a publication back in 2007 with information on the types of differences one might see between CPU types for given simulations. Association of parameter, software and hardware variation with large scale behavior across 57,000 climate models. It doesn't exactly answer your question directly, but it talks about the differences in output and what it means to the ensembles.


Thank-you. That is a really interesting paper.
ID: 61494 · Report as offensive     Reply Quote
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 61549 - Posted: 16 Nov 2019, 17:13:10 UTC

Do you have a GPU which is also crunching for BOINC and have you given over any CPU to it or running 100% CPU's? I had this problem but now I only run tasks on 50% CPU's which at least for me solved it. I think the GPU also needs its space. Anyway, even with BOINC running on 50% CPU's, the load in task manager reaches 70% sometimes. It is the GPU consuming the rest of the CPU's. The benefit is also that my CPU clock speed has increased.
ID: 61549 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,966,742
RAC: 21,869
Message 61550 - Posted: 16 Nov 2019, 17:30:13 UTC - in response to Message 61549.  

Do you have a GPU which is also crunching for BOINC and have you given over any CPU to it or running 100% CPU's? I had this problem but now I only run tasks on 50% CPU's which at least for me solved it. I think the GPU also needs its space. Anyway, even with BOINC running on 50% CPU's, the load in task manager reaches 70% sometimes. It is the GPU consuming the rest of the CPU's. The benefit is also that my CPU clock speed has increased.


I think it is more complex than that. On all operating systems stopping and restarting BOINC can lead to problems but more so with Linux tasks at least in my experience. Certainly overclocking makes it more likely. I have never had a GPU that will work with BOINC but I can believe that crunching on GPU alongside CPDN work could increase the risk. On some recent machines, running on all 40 threads means there is not enough cache memory on the CPU and there is a machine out there crashing everything with errors due to that. But the bottom line is there are several causes and some I suspect will always remain conjecture.
ID: 61550 · Report as offensive     Reply Quote

Message boards : Number crunching : Why do tasks crash on some machines but not others?

©2024 cpdn.org