climateprediction.net (CPDN) home page
Thread 'Linux Sulphur 4.23 Unstable'

Thread 'Linux Sulphur 4.23 Unstable'

Questions and Answers : Unix/Linux : Linux Sulphur 4.23 Unstable
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 19144 - Posted: 10 Jan 2006, 15:56:31 UTC
Last modified: 26 May 2006, 4:39:36 UTC

Edit...** It appears that the sulphur 4.23 Linux model is completely unstable. The programmers know about it, but their time is consumed with finishing up development of the next experiment. Suggestion is to run other BOINC projects until this is fixed, or until the next experiment is released (Beta IS stable on Linux). Those running Linux sulphur 4.22 and previous should be relatively stable. **

Edit 2...** LS Diseño has figured out a way to revert from 4.23 to 4.21 and has instructions for doing so below. You may want to give this a try if you want to continue crunching sulphur with Linux. **

Edit 3...** The coupled model, hadcm3l, might now be downloaded from the climateprediction.net site. The model is stable, compared to sulphur 4.23 anyway, but some users still have troubles with it. It is essentially the same model as being run at the BBC site, and you may want to peruse the BBC Linux board if you run into problems.

A dual Xeon 2.8 GHz running FC3 was running two sulphur 4.23 models. Each of them crashed during June of 1818, shortly after the halfway point of phase 1. Nothing intelligible was found at the end of yabsd.out in either model to help determine why they crashed.

This PC has been very stable up until this point. The Results for this hostID are 1614429 and 1614592. Below is the terminal log of the last crash. Doesn\'t look like anything helpful in this logging.

This looks like a problem with 4.23 or the WUs.

sulphur_ilwx_000868209 - PH 1 TS 0130189 A - 13/06/1818 06:30 - H:M:S=0139:40:28 AVG= 3.86 DLT= 2.00
sulphur_ixod_000883453 - PH 1 TS 0008121 A - 20/05/1811 04:30 - H:M:S=0008:02:09 AVG= 3.56 DLT= 2.00
sulphur_ixod_000883453 - PH 1 TS 0008122 A - 20/05/1811 05:00 - H:M:S=0008:02:12 AVG= 3.56 DLT= 3.00
Preparing for restart...
Error: Restart files for not found
Giving up, this result exceeded crash count for available restart files.
deflating : restart.day
deflating : yabsd.out
sulphur_ixod_000883453 - PH 1 TS 0008123 A - 20/05/1811 05:30 - H:M:S=0008:02:14 AVG= 3.56 DLT= 2.00
2006-01-10 09:57:27 [---] request_reschedule_cpus: process exited
2006-01-10 09:57:27 [climateprediction.net] Computation for result sulphur_ilwx_000868209_0 finished
sulphur_ixod_000883453 - PH 1 TS 0008124 A - 20/05/1811 06:00 - H:M:S=0008:02:16 AVG= 3.56 DLT= 2.00
2006-01-10 09:57:28 [climateprediction.net] Unrecoverable error for result sulphur_ilwx_000868209_0 ({file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_1.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
{file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_2.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
{file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_3.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
{file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_4.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
{file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_5.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
)
2006-01-10 09:57:28 [climateprediction.net] Unrecoverable error for result sulphur_ilwx_000868209_0 ({file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_1.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
{file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_2.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
{file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_3.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
{file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_4.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
{file_xfer_error}
{file_name}sulphur_ilwx_000868209_0_5.zip{/file_name}
{error_code}-161{/error_code}
{error_message}{/error_message}
{/file_xfer_error}
)
ID: 19144 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 19148 - Posted: 10 Jan 2006, 17:56:57 UTC
Last modified: 10 Jan 2006, 17:59:20 UTC

I notice this near the start of the log:

quote
Error: Restart files for not found
unquote

I wonder if this is just a grammatical mistake, or does it mean that some part of the error code can\'t find the model name?

It sure is a hard slog to get Linux to complete a sulphur model. :(

edit
It looks like one has to include the word quote oneself.

ID: 19148 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 19154 - Posted: 10 Jan 2006, 19:09:53 UTC - in response to Message 19148.  

I notice this near the start of the log:

quote
Error: Restart files for not found
unquote

I wonder if this is just a grammatical mistake, or does it mean that some part of the error code can\'t find the model name?

Typically it has something like \"dataout/restart.year\" after the \"for\". But it was obvious that it had already rewound previously because it was moving along at 3.86 s/TS, when yesterday it was doing 3.45 s/TS.
ID: 19154 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 19181 - Posted: 11 Jan 2006, 20:17:32 UTC
Last modified: 11 Jan 2006, 20:57:47 UTC

My first two models on AMD XP also shows crashes.
The first one gave up after TS 113528 26/6/1817 04:00. It also had crash&rewinds after trickle 8 (TS 86416). Terminal log looks similar to geophi\'s and nothing looks special at the end of yabsd.out.
Model 1617143

The second one has had three crash@rewinds now but is still running.
It looked like this on the terminal:

sulphur_irud_000875893 - PH 1 TS 0113997 A - 05/07/1817 22:30 - H:M:S=0110:41:48 AVG= 3.50 DLT= 3.00
sulphur_irud_000875893 - PH 1 TS 0113998 A - 05/07/1817 23:00 - H:M:S=0110:41:50 AVG= 3.50 DLT= 2.00
Preparing for restart...
Rewinding a model-day...
Starting model ID sulphur_irud_000875893 Phase 1
Getting pthread attributes - retval=0
Setting pthread size (66560000 bytes) - retval=0
Waiting for model startup, this may take a minute...
sulphur_irud_000875893 - PH 1 TS 0113905 A - 04/07/1817 00:30 - H:M:S=0110:41:52 AVG= 3.50 DLT= 0.00
sulphur_irud_000875893 - PH 1 TS 0113906 A - 04/07/1817 01:00 - H:M:S=0110:42:02 AVG= 3.50 DLT= 9.94
.
.
sulphur_irud_000875893 - PH 1 TS 0113996 A - 05/07/1817 22:00 - H:M:S=0110:47:18 AVG= 3.50 DLT=10.00
sulphur_irud_000875893 - PH 1 TS 0113997 A - 05/07/1817 22:30 - H:M:S=0110:47:20 AVG= 3.50 DLT= 1.99
Preparing for restart...
Rewinding a model-month...
Copying restart files for model retry...
Starting model ID sulphur_irud_000875893 Phase 1
Getting pthread attributes - retval=0
Setting pthread size (66560000 bytes) - retval=0
Waiting for model startup, this may take a minute...
sulphur_irud_000875893 - PH 1 TS 0113761 A - 01/07/1817 00:30 - H:M:S=0110:47:23 AVG= 3.51 DLT= 0.00
sulphur_irud_000875893 - PH 1 TS 0113762 A - 01/07/1817 01:00 - H:M:S=0110:47:34 AVG= 3.51 DLT=10.50
.
.
sulphur_irud_000875893 - PH 1 TS 0113997 A - 05/07/1817 22:30 - H:M:S=0111:01:09 AVG= 3.51 DLT= 1.91
sulphur_irud_000875893 - PH 1 TS 0113998 A - 05/07/1817 23:00 - H:M:S=0111:01:11 AVG= 3.51 DLT= 1.91
Preparing for restart...
Rewinding a model-year...
Copying restart files for model retry...
Starting model ID sulphur_irud_000875893 Phase 1
Getting pthread attributes - retval=0
Setting pthread size (66560000 bytes) - retval=0
Waiting for model startup, this may take a minute...
sulphur_irud_000875893 - PH 1 TS 0103681 A - 01/12/1816 00:30 - H:M:S=0111:01:14 AVG= 3.85 DLT= 0.00
sulphur_irud_000875893 - PH 1 TS 0103682 A - 01/12/1816 01:00 - H:M:S=0111:01:24 AVG= 3.85 DLT=10.28

This is HostID=3880, nice and reliable.
I took a copy of yabsd.out but I don\'t know if there\'s something special there,
two warning messages looks like this:
INITTIME: Warning- New STEP doesn\'t match old value
Internal model id 1 Old= 103536 New= 103680

So guess this one will finish too in a couple of hours. Something special I should monitor?


ID: 19181 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 19202 - Posted: 12 Jan 2006, 11:27:18 UTC

The model on host 3880 gave up at TS 112422 03/06 1817 03:00, model 1622417.

The bottom of yabsd.out looks like this:
Model aborted with error code - 1 Routine and message:-
ATM_DYN : NEGATIVE THETA DETECTED.
ID: 19202 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 19301 - Posted: 14 Jan 2006, 17:31:48 UTC

My third model also crashed.
ID: 19301 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19347 - Posted: 15 Jan 2006, 22:29:35 UTC - in response to Message 19301.  


Got the exact same error as you guys did. My model (running on a Red Hat Linux), died around timestep 113,000. Same error log as mentioned by you.

My last backup was before the crash. Now I have a backup with the crashed model. The crash seems to be reproducible, so running from the first backup you can reproduce the crash very easily.

So...

I would like to announce that I am keeping both archives (from before and after the crash), and if the developers need them for reproducing and debugging the error, I would gladly upload the data. If it could help to take a look at them, just give me an FTP connection on some server or something similar, and I shall upload the two backups. They are around 160 Mb each.

Cheers,
Stefan.
ID: 19347 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 19362 - Posted: 16 Jan 2006, 14:25:31 UTC

Two more sulphur 4.23 models have failed on the dual Xeon between 10,000 and 14000 timesteps into the first phase. Not good. I\'m suspending sulphur on that PC now.
ID: 19362 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 19392 - Posted: 17 Jan 2006, 14:15:04 UTC

Another crash after about 10 trickles on one of my AMD PCs.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1632117
ID: 19392 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19399 - Posted: 17 Jan 2006, 22:35:33 UTC - in response to Message 19392.  
Last modified: 17 Jan 2006, 23:18:22 UTC

Judging by the crashes everyone seems to be getting, it seems that sulphur no longer works on Linux.

Thus I have suspended all of my Linux workstations until the developers find the problem. Switched these machines to LHC@Home in the meantime...

Hope the problem will be fixed soon. :((

Stefan.
ID: 19399 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 19463 - Posted: 20 Jan 2006, 13:29:59 UTC
Last modified: 20 Jan 2006, 13:54:48 UTC

My fourth model, 1638004, ended on TS 114329 12/07/1817 20:30.
The fifth model, 1631205, gave up on TS 113686 29/06/1817 11:00. This was on a nice XP 2500+ at stock speed.

All of my misbehaving models were created 23 Dec.
Got me a newer one now from 16 Jan.
ID: 19463 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 19468 - Posted: 20 Jan 2006, 15:21:16 UTC
Last modified: 4 Feb 2006, 14:17:13 UTC

I\'ve e-mailed Tolu about it and he knows there\'s a problem. When it will be fixed...given the upcoming coupled launch...??
ID: 19468 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 19470 - Posted: 20 Jan 2006, 15:55:07 UTC

Oh..
I wish they could switch to slabs then, just for Linux?
I only have 1.5 old slabs here now to feed three machines until the HadCM3L launch. (No, I can\'t switch to Windows)
ID: 19470 · Report as offensive     Reply Quote
Desti

Send message
Joined: 6 Aug 04
Posts: 124
Credit: 9,195,838
RAC: 0
Message 19473 - Posted: 20 Jan 2006, 18:20:21 UTC
Last modified: 20 Jan 2006, 18:22:25 UTC

My sulphur 4.23 crashed too :-( (A64 X2 x86_64 Gentoo)


sulphur_ioa2_000871274 - PH 1 TS 0129201 A - 22/05/1818 16:30 - H:M:S=0103:50:34 AVG= 2.89 DLT= 1.00
sulphur_ioa2_000871274 - PH 1 TS 0129202 A - 22/05/1818 17:00 - H:M:S=0103:50:36 AVG= 2.89 DLT= 2.00
sulphur_ioa2_000871274 - PH 1 TS 0129203 A - 22/05/1818 17:30 - H:M:S=0103:50:38 AVG= 2.89 DLT= 2.00
sulphur_ioa2_000871274 - PH 1 TS 0129204 A - 22/05/1818 18:00 - H:M:S=0103:50:40 AVG= 2.89 DLT= 1.98
sulphur_ioa2_000871274 - PH 1 TS 0129205 A - 22/05/1818 18:30 - H:M:S=0103:50:41 AVG= 2.89 DLT= 0.95
sulphur_ioa2_000871274 - PH 1 TS 0129206 A - 22/05/1818 19:00 - H:M:S=0103:50:49 AVG= 2.89 DLT= 7.99
Preparing for restart...
Error: Restart files for not found
Giving up, this result exceeded crash count for available restart files.
deflating : restart.day
deflating : yabsd.out
2006-01-20 15:56:06 [---] request_reschedule_cpus: process exited
2006-01-20 15:56:06 [climateprediction.net] Computation for result sulphur_ioa2_000871274_0 finished
2006-01-20 15:56:06 [LHC@home] Starting result wjan1A_v6s4hvnom_mqx_nc__17__64.269_59.279__4_6__6__85_1_sixvf_boinc81550_3 using sixtrack version 466
2006-01-20 15:56:07 [climateprediction.net] Unrecoverable error for result sulphur_ioa2_000871274_0 (<file_xfer_error>
<file_name>sulphur_ioa2_000871274_0_1.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>

2006-01-20 15:56:07 [climateprediction.net] Unrecoverable error for result sulphur_ioa2_000871274_0 (<file_xfer_error>





2006-01-20 16:42:37 [LHC@home] Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
2006-01-20 16:42:37 [LHC@home] Reason: To report results
2006-01-20 16:42:37 [LHC@home] Reporting 1 results
2006-01-20 16:42:42 [LHC@home] Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
2006-01-20 17:20:23 [LHC@home] Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
2006-01-20 17:20:23 [LHC@home] Reason: To fetch work
2006-01-20 17:20:23 [LHC@home] Requesting 12 seconds of new work
2006-01-20 17:20:28 [LHC@home] Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
2006-01-20 17:20:29 [LHC@home] Started download of woct1_v6s4hvnom_mqx-oct1__5__64.202_59.212__4_6__6__55_1_sixvf_boinc11300.zip
2006-01-20 17:20:31 [LHC@home] Finished download of woct1_v6s4hvnom_mqx-oct1__5__64.202_59.212__4_6__6__55_1_sixvf_boinc11300.zip
2006-01-20 17:20:31 [LHC@home] Throughput 33359 bytes/sec
2006-01-20 17:20:32 [---] request_reschedule_cpus: files downloaded
2006-01-20 17:20:32 [Einstein@Home] Pausing result z1_0361.0__48_S4R2a_2 (removed from memory)
2006-01-20 17:20:32 [LHC@home] Starting result woct1_v6s4hvnom_mqx-oct1__5__64.202_59.212__4_6__6__55_1_sixvf_boinc11300_1 using sixtrack version 466
2006-01-20 17:20:33 [---] request_reschedule_cpus: process exited
2006-01-20 18:20:10 [climateprediction.net] Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
2006-01-20 18:20:10 [climateprediction.net] Reason: To report results
2006-01-20 18:20:10 [climateprediction.net] Reporting 1 results
2006-01-20 18:20:15 [climateprediction.net] Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded


(doh, tags are filtered)
Linux Users Everywhere @ BOINC
ID: 19473 · Report as offensive     Reply Quote
old_user105386

Send message
Joined: 1 Nov 05
Posts: 2
Credit: 80,395
RAC: 0
Message 19554 - Posted: 22 Jan 2006, 21:32:08 UTC

Same problem for the third time with Sulphur on my Opteron244 SuSE 10.0 64-bit.

The problem arose after the 10-th trickle the first 2 times and after the 9-th the third time

A fourth file is still running but if it crashes as well, I\'ll stop calculating on Climateprediction files foe the moment

Report of the last crash

2006-01-22 21:13:44 [climateprediction.net] Unrecoverable error for result sulphur_i80w_100850208_0 (<file_xfer_error>
<file_name>sulphur_i80w_100850208_0_1.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_i80w_100850208_0_2.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_i80w_100850208_0_3.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_i80w_100850208_0_4.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_i80w_100850208_0_5.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
)

Beer for Linux Users Everywhere
ID: 19554 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 19555 - Posted: 22 Jan 2006, 21:45:00 UTC

Beer
The 161 errors are a \'red herring\'. If there is any record of the REAL reason for the failure, it will be near the bottom of the file: yabsd.out, which is in the dataout folder of the model.

ID: 19555 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 19560 - Posted: 23 Jan 2006, 0:22:21 UTC - in response to Message 19554.  

The problem arose after the 10-th trickle the first 2 times and after the 9-th the third time

Yep, that sounds like the 4.23 problem. All five of of my failures crashed between 9 and 13 trickles into the run.
ID: 19560 · Report as offensive     Reply Quote
old_user147974

Send message
Joined: 10 Jan 06
Posts: 3
Credit: 259,190
RAC: 0
Message 19851 - Posted: 1 Feb 2006, 10:28:58 UTC - in response to Message 19555.  
Last modified: 1 Feb 2006, 10:31:26 UTC

Hi,

same errors here on an Athlon 1500+ and 3000+ both running Ubuntu Breezy 5.10.

Can\'t post the actual 161 error message, as the pasted text confuses the BBcode so only part of the error message gets displayed...

The yabsd.out file contains a lot of cryptic stuff, but what I can make out are warnings about computations giving negative values...
ID: 19851 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 19974 - Posted: 4 Feb 2006, 13:41:25 UTC
Last modified: 4 Feb 2006, 13:50:30 UTC

Three more crashed models around trickle 10.
That\'s a total of eight now on four different machines.

Edit: got a new one created today, but this is the last try.
ID: 19974 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 19976 - Posted: 4 Feb 2006, 14:15:31 UTC

Although Tolu and Carl know about this problem, I have a feeling it won\'t be fixed in a new sulphur version until after the launch of the coupled model experiment. That is where their time is going now.
ID: 19976 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Questions and Answers : Unix/Linux : Linux Sulphur 4.23 Unstable

©2024 cpdn.org