climateprediction.net home page
crash with code 251

crash with code 251

Questions and Answers : Unix/Linux : crash with code 251
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user57798

Send message
Joined: 25 Feb 05
Posts: 6
Credit: 11,512
RAC: 0
Message 10220 - Posted: 2 Mar 2005, 15:52:23 UTC

I've got boinc/climate running on a couple of windoz OSs and a linux box but am having trouble with a second linux box (GenuineIntel Intel(R) Pentium(R) 4 CPU 1700MHz). I've run mprime for 24 hrs on this machine without difficulties (as recomended elsewhere) but (with same Debian version OS as the other one) I get "Model crash ....(process exited with code 251) .... Defering communication for 1 hr..." and no processing going on. Any suggestions?

kyle
ID: 10220 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 10247 - Posted: 2 Mar 2005, 20:17:17 UTC

Hi kyle
Are you using a network drive for the files?
If so, this is a bad move.

Les
ID: 10247 · Report as offensive     Reply Quote
old_user57798

Send message
Joined: 25 Feb 05
Posts: 6
Credit: 11,512
RAC: 0
Message 10295 - Posted: 3 Mar 2005, 12:10:59 UTC - in response to Message 10247.  

> Hi kyle
> Are you using a network drive for the files?
> If so, this is a bad move.
>
> Les
>

No. And now, through no change on my part (except to restart) the model is running (for about 24 hrs). I notice in the 'Nature' article that some parameter choices are unstable and cause the calculation to crash. Could possibly the first set of model parameters on this machine been unstable and now I have a different set of parameters? How often does something like that happen? Is there a way to tell if that happened?

Thanks,

kyle
ID: 10295 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 10304 - Posted: 3 Mar 2005, 13:47:17 UTC
Last modified: 3 Mar 2005, 13:47:54 UTC

I've seen the 251 error code mentioned before, but I think the explanation was on the forum that is down.
You're right about unstable parameter sets. My 1st model crashed with -5 error, a general 'catch all' code.
Last year there was a spate of crashes that were called 'fast processing ice balls', because the earth,
in the vis, showed ice everywhere early in the crunching.

As it says <a href="http://www.climateprediction.net/science/strategy.php"> here</a>, under experiment 1, the current models are about finding out what is stable and what isn't.

So, keep crunching, and hopefully your new model will be successfull. But if not, there's always someone here to help.

Les



ID: 10304 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 10305 - Posted: 3 Mar 2005, 13:55:19 UTC - in response to Message 10304.  

&gt; I've seen the 251 error code mentioned before, but I think the explanation was
&gt; on the forum that is down.
&gt; You're right about unstable parameter sets. My 1st model crashed with -5
&gt; error, a general 'catch all' code.

Just a thought to throw into this.

If you mask -5 down to a single byte and read it as unsigned you get ... 251
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 10305 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2183
Credit: 64,822,615
RAC: 5,275
Message 10306 - Posted: 3 Mar 2005, 14:37:12 UTC - in response to Message 10305.  

&gt;
&gt; If you mask -5 down to a single byte and read it as unsigned you get ... 251
&gt;

That's right. Astro and I were having this error in sulphur alpha in Linux. The yabsd.out file had negative theta detected errors.
ID: 10306 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 10773 - Posted: 12 Mar 2005, 18:19:29 UTC
Last modified: 17 Mar 2005, 2:38:48 UTC

Two recent -251 errors (after 55 straight successfully-completed Models).

Bbox, SuSE 9.0, P4 3.0, ASUS P4P800 MB, 1 gig dual-channel CL2.0 mem.

Model #0m6u_n...nnn error -251 @ Phase 1, T.S.114,913

From zipped residual yabsd.out:
im,sm,ngroup,new_im,new_sm 1 1 48 T F
NEGATIVE THETA AT POINT 1 LEVEL 18
*********************************************************************************
Model aborted with error code - 1 Routine and message:-
ATM_DYN : NEGATIVE THETA DETECTED.
*********************************************************************************


Negative Theta error was experienced twice in Sulfur Alpha tests, as George noted. They were also logged at Point 1, Level 18. (On different machines [Abox &amp; Ebox, SuSE 9.0 &amp; 9.1, P4 2.8 &amp; P4 3.4], hence I'm not jumping to blame Bbox. Yet.) Tolu diagnosed it as a compiler problem, fell back, and released 4.10 for Sulfur Alpha Linux. If it's the same 4.10 we're running here, it's especially distressing to see the beast raise its ugly head again.


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[EDIT: Another Negative Theta! This machine had 15 straight completed CPDNboinc runs, now three failures in a row -- all while the parallel run on the P4 CPU continued on its merry way. -- This failure was identical the the one above and at almost the same point in the run; this one at T.S. 114193 rather that 114913. Well... Enough of that! Prime95 is crunching away on its seventh set of Self-Test iterations without error. I'll let it run overnight, after restarting boinc so the two CPDN Models will bring the CPU up to its usual operating temperature/load for the continuing test.]
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


More on a Bbox Model crash reported earlier:
Model #08fk_n...nnn error -251 @ Phase 1, T.S. 9073

From zipped residual yabsd.out:
REPLANCA: UPDATE REQUIRED FOR FIELD 38
REPLANCA - time interpolation for field 38
time,time1,time2 1260.000 720.0000 1440.000
hours,int,period 1260 720 8640
Information used in checking ancillary data set: position of lookup table in dataset: 6
Position of first lookup table referring to data type 3
Interval between lookup tables referring to data type 3 Number of steps 1
STASH code in dataset 32 STASH code requested 32
'Start' position of lookup tables for dataset in overall lookup array
241
im,sm,ngroup,new_im,new_sm 1 1 48 T F
NEGATIVE PRESSURE AT POINT 1837
NEGATIVE PRESSURE AT POINT 1838
NEGATIVE PRESSURE AT POINT 1839
NEGATIVE PRESSURE AT POINT 1933
NEGATIVE PRESSURE AT P
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 10773 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11032 - Posted: 17 Mar 2005, 15:05:53 UTC

Re. the Edit in my post, below:

Torture Test ran 13 hours, 56 minutes - 0 errors, 0 warnings.
jim@Bbox:~/Desktop/Prime95&gt;

So, why the failures? A "bad" batch of parameter sets? Any other posibilities?

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11032 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2183
Credit: 64,822,615
RAC: 5,275
Message 11036 - Posted: 17 Mar 2005, 16:08:49 UTC - in response to Message 11032.  

&gt; Re. the Edit in my post, below:
&gt;
&gt; Torture Test ran 13 hours, 56 minutes - 0 errors, 0 warnings.
&gt; jim@Bbox:~/Desktop/Prime95&gt;
&gt;
&gt; So, why the failures? A "bad" batch of parameter sets? Any other
&gt; posibilities?
&gt;
&gt;
Probably a bad batch of parameter sets.

Jim, on the Prime95 on Linux, were both processors running at 100%? On Windows, depending on what torture test I specify, I get a lot of writing to the page file (this was on the default torture test), resulting in not 100% CPU utilization. By choosing specific tests, I can max out the processor.
ID: 11036 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11047 - Posted: 17 Mar 2005, 20:42:37 UTC - in response to Message 11036.  
Last modified: 17 Mar 2005, 20:44:37 UTC

&gt; &gt; Re. the Edit in my post, below:
&gt; &gt;
&gt; &gt; Torture Test ran 13 hours, 56 minutes - 0 errors, 0 warnings.
&gt; &gt; jim@Bbox:~/Desktop/Prime95&gt;
&gt; &gt;
&gt; &gt; So, why the failures? A "bad" batch of parameter sets? Any other
&gt; &gt; posibilities?
&gt; &gt;
&gt; &gt;
&gt; Probably a bad batch of parameter sets.
&gt;
&gt; Jim, on the Prime95 on Linux, were both processors running at 100%? On
&gt; Windows, depending on what torture test I specify, I get a lot of writing to
&gt; the page file (this was on the default torture test), resulting in not 100%
&gt; CPU utilization. By choosing specific tests, I can max out the processor.

Hi, George,

An hour, at most ran stand-alone. I could have ginned-up a second copy of Prime95 but chose to restart CPDN and then stop one of the CPDN runs. So, a good 13 hours were with the CPU maxed-out. [Edit: No unusual disk activity noted.]

More evidence for a bad batch, I think: Ebox upchucked its first ever run (not counting Sulfur_Alpha) -- with a negative Theta.

Curious, after noting Ebox's Model-time of failure, I went back and looked at the date/time for the other three failures. Recall the Alpha problem where runs all failed at a 144-TS boundary? Well, all four of these runs failed on the first TS after a 144-TS boundary.

Negative Theta @ TS 111457=13/05/1817 0030Z, TS 114193=10/07/1817 0030Z, TS 114913=25/07/1817 0030Z
Negative Pressure @ TS 9073=10/06/1811 0030Z

Looks familiar, eh?

Jim
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11047 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2183
Credit: 64,822,615
RAC: 5,275
Message 11054 - Posted: 18 Mar 2005, 1:52:36 UTC - in response to Message 11047.  

&gt; Curious, after noting Ebox's Model-time of failure, I went back and looked at
&gt; the date/time for the other three failures. Recall the Alpha problem where
&gt; runs all failed at a 144-TS boundary? Well, all four of these runs failed on
&gt; the first TS after a 144-TS boundary.
&gt;
&gt; Negative Theta @ TS 111457=13/05/1817 0030Z, TS 114193=10/07/1817 0030Z, TS
&gt; 114913=25/07/1817 0030Z
&gt; Negative Pressure @ TS 9073=10/06/1811 0030Z
&gt;
&gt; Looks familiar, eh?
&gt;
&gt; Jim
&gt;
Hmmm. Did these download 4.11 slab? Something's going on. See <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2237">this post</a> for additional suspicious problems. You may want to e-mail Tolu. I'd hate to think 4.11 is causing all this.
ID: 11054 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11057 - Posted: 18 Mar 2005, 2:54:46 UTC - in response to Message 11054.  
Last modified: 18 Mar 2005, 16:12:48 UTC

&gt; Hmmm. Did these download 4.11 slab? Something's going on. See <a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2237"&gt;this
&gt; post</a> for additional suspicious problems. You may want to e-mail Tolu.
&gt; I'd hate to think 4.11 is causing all this.

In the Slots in both Bbox &amp; Ebox where the fatalities were run, the replacements have 4.11. (The remaining active Slot on Bbox is 4.04, running okay. On Ebox, it's Sulfur_Alpha.)

I don't know how to determine from the corpses' three residual files which version ran them. [To be sure, current Models activated from the same group were suspended -- so, Ebox runs a Sulfur Alpha Model stand-alone and Bbox runs a 4.04 Model stand-alone. That leaves wasted P4 resources but, at least, they aren't trying to pour more water up a rope.])

Much as I hate to add a brick to Tolu's hod, I'll send an email.


Edit: Tolu replied with thanks and said he's looking into it now.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11057 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11340 - Posted: 24 Mar 2005, 2:46:33 UTC
Last modified: 24 Mar 2005, 2:48:30 UTC

One more time -- fifth 4.11 crash, this time on Dbox, third of my machines to have a "hardware failure" with this version. Increasingly disgusting that there is no information about a potential solution.

[Edit: Corrected a typo.]

Is this affecting only the minority of us posting the errors? Or is it widespread and unnoticed, ignored, or silently driving people away from CPDN?

Once again, on a 144-TS boundary. (Yawn.) From residual zipped files:
{PH}1{/PH}
{TS}127153{/TS}
{DAY}10{/DAY}
{MTH}4{/MTH}
{YR}1818{/YR}
{HR}0{/HR}
{MIN}30{/MIN}


MODEL DUMP SUCCESSFULLY WRITTEN - 1292182 WORDS TO UNIT 22

Number of Words Written to Disk was 1292678
im,sm,ngroup,new_im,new_sm 1 1 48 T F
NEGATIVE THETA AT POINT 1 LEVEL 18
*********************************************************************************
Model aborted with error code - 1 Routine and message:-
ATM_DYN : NEGATIVE THETA DETECTED.
*********************************************************************************
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11340 · Report as offensive     Reply Quote
cetus

Send message
Joined: 7 Aug 04
Posts: 10
Credit: 147,460,763
RAC: 38,189
Message 11462 - Posted: 27 Mar 2005, 3:46:46 UTC - in response to Message 11340.  

&gt; One more time -- fifth 4.11 crash, this time on Dbox, third of my machines to
&gt; have a "hardware failure" with this version. Increasingly disgusting that
&gt; there is no information about a potential solution.
&gt;
&gt; [Edit: Corrected a typo.]
&gt;
&gt; Is this affecting only the minority of us posting the errors? Or is it
&gt; widespread and unnoticed, ignored, or silently driving people away from CPDN?
&gt;

I appear to have the same problem too. I've had three 4.11 crashes in phase one on a P4 linux box. The same machine is still successfully running a 4.04 model, so a hardware issue seems unlikely.
ID: 11462 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : crash with code 251

©2024 cpdn.org