Linux Sulphur 4.23 Unstable

Author	Message
geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 19144 - Posted: 10 Jan 2006, 15:56:31 UTC Last modified: 26 May 2006, 4:39:36 UTC Edit... It appears that the sulphur 4.23 Linux model is completely unstable. The programmers know about it, but their time is consumed with finishing up development of the next experiment. Suggestion is to run other BOINC projects until this is fixed, or until the next experiment is released (Beta IS stable on Linux). Those running Linux sulphur 4.22 and previous should be relatively stable. Edit 2... LS DiseÃ±o has figured out a way to revert from 4.23 to 4.21 and has instructions for doing so below. You may want to give this a try if you want to continue crunching sulphur with Linux. Edit 3...** The coupled model, hadcm3l, might now be downloaded from the climateprediction.net site. The model is stable, compared to sulphur 4.23 anyway, but some users still have troubles with it. It is essentially the same model as being run at the BBC site, and you may want to peruse the BBC Linux board if you run into problems. A dual Xeon 2.8 GHz running FC3 was running two sulphur 4.23 models. Each of them crashed during June of 1818, shortly after the halfway point of phase 1. Nothing intelligible was found at the end of yabsd.out in either model to help determine why they crashed. This PC has been very stable up until this point. The Results for this hostID are 1614429 and 1614592. Below is the terminal log of the last crash. Doesn\'t look like anything helpful in this logging. This looks like a problem with 4.23 or the WUs. sulphur_ilwx_000868209 - PH 1 TS 0130189 A - 13/06/1818 06:30 - H:M:S=0139:40:28 AVG= 3.86 DLT= 2.00 sulphur_ixod_000883453 - PH 1 TS 0008121 A - 20/05/1811 04:30 - H:M:S=0008:02:09 AVG= 3.56 DLT= 2.00 sulphur_ixod_000883453 - PH 1 TS 0008122 A - 20/05/1811 05:00 - H:M:S=0008:02:12 AVG= 3.56 DLT= 3.00 Preparing for restart... Error: Restart files for not found Giving up, this result exceeded crash count for available restart files. deflating : restart.day deflating : yabsd.out sulphur_ixod_000883453 - PH 1 TS 0008123 A - 20/05/1811 05:30 - H:M:S=0008:02:14 AVG= 3.56 DLT= 2.00 2006-01-10 09:57:27 [---] request_reschedule_cpus: process exited 2006-01-10 09:57:27 [climateprediction.net] Computation for result sulphur_ilwx_000868209_0 finished sulphur_ixod_000883453 - PH 1 TS 0008124 A - 20/05/1811 06:00 - H:M:S=0008:02:16 AVG= 3.56 DLT= 2.00 2006-01-10 09:57:28 [climateprediction.net] Unrecoverable error for result sulphur_ilwx_000868209_0 ({file_xfer_error} {file_name}sulphur_ilwx_000868209_0_1.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_2.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_3.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_4.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_5.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} ) 2006-01-10 09:57:28 [climateprediction.net] Unrecoverable error for result sulphur_ilwx_000868209_0 ({file_xfer_error} {file_name}sulphur_ilwx_000868209_0_1.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_2.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_3.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_4.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_5.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} ) ID: 19144 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 19148 - Posted: 10 Jan 2006, 17:56:57 UTC Last modified: 10 Jan 2006, 17:59:20 UTC I notice this near the start of the log: quote Error: Restart files for not found unquote I wonder if this is just a grammatical mistake, or does it mean that some part of the error code can\'t find the model name? It sure is a hard slog to get Linux to complete a sulphur model. :( edit It looks like one has to include the word quote oneself. ID: 19148 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 19154 - Posted: 10 Jan 2006, 19:09:53 UTC - in response to Message 19148. I notice this near the start of the log: quote Error: Restart files for not found unquote I wonder if this is just a grammatical mistake, or does it mean that some part of the error code can\'t find the model name? Typically it has something like \"dataout/restart.year\" after the \"for\". But it was obvious that it had already rewound previously because it was moving along at 3.86 s/TS, when yesterday it was doing 3.45 s/TS. ID: 19154 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 19181 - Posted: 11 Jan 2006, 20:17:32 UTC Last modified: 11 Jan 2006, 20:57:47 UTC My first two models on AMD XP also shows crashes. The first one gave up after TS 113528 26/6/1817 04:00. It also had crash&rewinds after trickle 8 (TS 86416). Terminal log looks similar to geophi\'s and nothing looks special at the end of yabsd.out. Model 1617143 The second one has had three crash@rewinds now but is still running. It looked like this on the terminal: sulphur_irud_000875893 - PH 1 TS 0113997 A - 05/07/1817 22:30 - H:M:S=0110:41:48 AVG= 3.50 DLT= 3.00 sulphur_irud_000875893 - PH 1 TS 0113998 A - 05/07/1817 23:00 - H:M:S=0110:41:50 AVG= 3.50 DLT= 2.00 Preparing for restart... Rewinding a model-day... Starting model ID sulphur_irud_000875893 Phase 1 Getting pthread attributes - retval=0 Setting pthread size (66560000 bytes) - retval=0 Waiting for model startup, this may take a minute... sulphur_irud_000875893 - PH 1 TS 0113905 A - 04/07/1817 00:30 - H:M:S=0110:41:52 AVG= 3.50 DLT= 0.00 sulphur_irud_000875893 - PH 1 TS 0113906 A - 04/07/1817 01:00 - H:M:S=0110:42:02 AVG= 3.50 DLT= 9.94 . . sulphur_irud_000875893 - PH 1 TS 0113996 A - 05/07/1817 22:00 - H:M:S=0110:47:18 AVG= 3.50 DLT=10.00 sulphur_irud_000875893 - PH 1 TS 0113997 A - 05/07/1817 22:30 - H:M:S=0110:47:20 AVG= 3.50 DLT= 1.99 Preparing for restart... Rewinding a model-month... Copying restart files for model retry... Starting model ID sulphur_irud_000875893 Phase 1 Getting pthread attributes - retval=0 Setting pthread size (66560000 bytes) - retval=0 Waiting for model startup, this may take a minute... sulphur_irud_000875893 - PH 1 TS 0113761 A - 01/07/1817 00:30 - H:M:S=0110:47:23 AVG= 3.51 DLT= 0.00 sulphur_irud_000875893 - PH 1 TS 0113762 A - 01/07/1817 01:00 - H:M:S=0110:47:34 AVG= 3.51 DLT=10.50 . . sulphur_irud_000875893 - PH 1 TS 0113997 A - 05/07/1817 22:30 - H:M:S=0111:01:09 AVG= 3.51 DLT= 1.91 sulphur_irud_000875893 - PH 1 TS 0113998 A - 05/07/1817 23:00 - H:M:S=0111:01:11 AVG= 3.51 DLT= 1.91 Preparing for restart... Rewinding a model-year... Copying restart files for model retry... Starting model ID sulphur_irud_000875893 Phase 1 Getting pthread attributes - retval=0 Setting pthread size (66560000 bytes) - retval=0 Waiting for model startup, this may take a minute... sulphur_irud_000875893 - PH 1 TS 0103681 A - 01/12/1816 00:30 - H:M:S=0111:01:14 AVG= 3.85 DLT= 0.00 sulphur_irud_000875893 - PH 1 TS 0103682 A - 01/12/1816 01:00 - H:M:S=0111:01:24 AVG= 3.85 DLT=10.28 This is HostID=3880, nice and reliable. I took a copy of yabsd.out but I don\'t know if there\'s something special there, two warning messages looks like this: INITTIME: Warning- New STEP doesn\'t match old value Internal model id 1 Old= 103536 New= 103680 So guess this one will finish too in a couple of hours. Something special I should monitor? ID: 19181 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 19202 - Posted: 12 Jan 2006, 11:27:18 UTC The model on host 3880 gave up at TS 112422 03/06 1817 03:00, model 1622417. The bottom of yabsd.out looks like this: Model aborted with error code - 1 Routine and message:- ATM_DYN : NEGATIVE THETA DETECTED. ID: 19202 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 19301 - Posted: 14 Jan 2006, 17:31:48 UTC My third model also crashed. ID: 19301 · Reply Quote

old_user21637 Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0	Message 19347 - Posted: 15 Jan 2006, 22:29:35 UTC - in response to Message 19301. Got the exact same error as you guys did. My model (running on a Red Hat Linux), died around timestep 113,000. Same error log as mentioned by you. My last backup was before the crash. Now I have a backup with the crashed model. The crash seems to be reproducible, so running from the first backup you can reproduce the crash very easily. So... I would like to announce that I am keeping both archives (from before and after the crash), and if the developers need them for reproducing and debugging the error, I would gladly upload the data. If it could help to take a look at them, just give me an FTP connection on some server or something similar, and I shall upload the two backups. They are around 160 Mb each. Cheers, Stefan. ID: 19347 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 19362 - Posted: 16 Jan 2006, 14:25:31 UTC Two more sulphur 4.23 models have failed on the dual Xeon between 10,000 and 14000 timesteps into the first phase. Not good. I\'m suspending sulphur on that PC now. ID: 19362 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 19392 - Posted: 17 Jan 2006, 14:15:04 UTC Another crash after about 10 trickles on one of my AMD PCs. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1632117 ID: 19392 · Reply Quote

old_user21637 Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0	Message 19399 - Posted: 17 Jan 2006, 22:35:33 UTC - in response to Message 19392. Last modified: 17 Jan 2006, 23:18:22 UTC Judging by the crashes everyone seems to be getting, it seems that sulphur no longer works on Linux. Thus I have suspended all of my Linux workstations until the developers find the problem. Switched these machines to LHC@Home in the meantime... Hope the problem will be fixed soon. :(( Stefan. ID: 19399 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 19463 - Posted: 20 Jan 2006, 13:29:59 UTC Last modified: 20 Jan 2006, 13:54:48 UTC My fourth model, 1638004, ended on TS 114329 12/07/1817 20:30. The fifth model, 1631205, gave up on TS 113686 29/06/1817 11:00. This was on a nice XP 2500+ at stock speed. All of my misbehaving models were created 23 Dec. Got me a newer one now from 16 Jan. ID: 19463 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 19468 - Posted: 20 Jan 2006, 15:21:16 UTC Last modified: 4 Feb 2006, 14:17:13 UTC I\'ve e-mailed Tolu about it and he knows there\'s a problem. When it will be fixed...given the upcoming coupled launch...?? ID: 19468 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 19470 - Posted: 20 Jan 2006, 15:55:07 UTC Oh.. I wish they could switch to slabs then, just for Linux? I only have 1.5 old slabs here now to feed three machines until the HadCM3L launch. (No, I can\'t switch to Windows) ID: 19470 · Reply Quote

Desti Send message Joined: 6 Aug 04 Posts: 124 Credit: 9,195,838 RAC: 0	Message 19473 - Posted: 20 Jan 2006, 18:20:21 UTC Last modified: 20 Jan 2006, 18:22:25 UTC My sulphur 4.23 crashed too :-( (A64 X2 x86_64 Gentoo) sulphur_ioa2_000871274 - PH 1 TS 0129201 A - 22/05/1818 16:30 - H:M:S=0103:50:34 AVG= 2.89 DLT= 1.00 sulphur_ioa2_000871274 - PH 1 TS 0129202 A - 22/05/1818 17:00 - H:M:S=0103:50:36 AVG= 2.89 DLT= 2.00 sulphur_ioa2_000871274 - PH 1 TS 0129203 A - 22/05/1818 17:30 - H:M:S=0103:50:38 AVG= 2.89 DLT= 2.00 sulphur_ioa2_000871274 - PH 1 TS 0129204 A - 22/05/1818 18:00 - H:M:S=0103:50:40 AVG= 2.89 DLT= 1.98 sulphur_ioa2_000871274 - PH 1 TS 0129205 A - 22/05/1818 18:30 - H:M:S=0103:50:41 AVG= 2.89 DLT= 0.95 sulphur_ioa2_000871274 - PH 1 TS 0129206 A - 22/05/1818 19:00 - H:M:S=0103:50:49 AVG= 2.89 DLT= 7.99 Preparing for restart... Error: Restart files for not found Giving up, this result exceeded crash count for available restart files. deflating : restart.day deflating : yabsd.out 2006-01-20 15:56:06 [---] request_reschedule_cpus: process exited 2006-01-20 15:56:06 [climateprediction.net] Computation for result sulphur_ioa2_000871274_0 finished 2006-01-20 15:56:06 [LHC@home] Starting result wjan1A_v6s4hvnom_mqx_nc__17__64.269_59.279__4_6__6__85_1_sixvf_boinc81550_3 using sixtrack version 466 2006-01-20 15:56:07 [climateprediction.net] Unrecoverable error for result sulphur_ioa2_000871274_0 (<file_xfer_error> <file_name>sulphur_ioa2_000871274_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> 2006-01-20 15:56:07 [climateprediction.net] Unrecoverable error for result sulphur_ioa2_000871274_0 (<file_xfer_error> 2006-01-20 16:42:37 [LHC@home] Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 2006-01-20 16:42:37 [LHC@home] Reason: To report results 2006-01-20 16:42:37 [LHC@home] Reporting 1 results 2006-01-20 16:42:42 [LHC@home] Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 2006-01-20 17:20:23 [LHC@home] Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 2006-01-20 17:20:23 [LHC@home] Reason: To fetch work 2006-01-20 17:20:23 [LHC@home] Requesting 12 seconds of new work 2006-01-20 17:20:28 [LHC@home] Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 2006-01-20 17:20:29 [LHC@home] Started download of woct1_v6s4hvnom_mqx-oct1__5__64.202_59.212__4_6__6__55_1_sixvf_boinc11300.zip 2006-01-20 17:20:31 [LHC@home] Finished download of woct1_v6s4hvnom_mqx-oct1__5__64.202_59.212__4_6__6__55_1_sixvf_boinc11300.zip 2006-01-20 17:20:31 [LHC@home] Throughput 33359 bytes/sec 2006-01-20 17:20:32 [---] request_reschedule_cpus: files downloaded 2006-01-20 17:20:32 [Einstein@Home] Pausing result z1_0361.0__48_S4R2a_2 (removed from memory) 2006-01-20 17:20:32 [LHC@home] Starting result woct1_v6s4hvnom_mqx-oct1__5__64.202_59.212__4_6__6__55_1_sixvf_boinc11300_1 using sixtrack version 466 2006-01-20 17:20:33 [---] request_reschedule_cpus: process exited 2006-01-20 18:20:10 [climateprediction.net] Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi 2006-01-20 18:20:10 [climateprediction.net] Reason: To report results 2006-01-20 18:20:10 [climateprediction.net] Reporting 1 results 2006-01-20 18:20:15 [climateprediction.net] Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded (doh, tags are filtered) Linux Users Everywhere @ BOINC ID: 19473 · Reply Quote

old_user105386 Send message Joined: 1 Nov 05 Posts: 2 Credit: 80,395 RAC: 0	Message 19554 - Posted: 22 Jan 2006, 21:32:08 UTC Same problem for the third time with Sulphur on my Opteron244 SuSE 10.0 64-bit. The problem arose after the 10-th trickle the first 2 times and after the 9-th the third time A fourth file is still running but if it crashes as well, I\'ll stop calculating on Climateprediction files foe the moment Report of the last crash 2006-01-22 21:13:44 [climateprediction.net] Unrecoverable error for result sulphur_i80w_100850208_0 (<file_xfer_error> <file_name>sulphur_i80w_100850208_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_i80w_100850208_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_i80w_100850208_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_i80w_100850208_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_i80w_100850208_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> ) Beer for Linux Users Everywhere ID: 19554 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 19555 - Posted: 22 Jan 2006, 21:45:00 UTC Beer The 161 errors are a \'red herring\'. If there is any record of the REAL reason for the failure, it will be near the bottom of the file: yabsd.out, which is in the dataout folder of the model. ID: 19555 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 19560 - Posted: 23 Jan 2006, 0:22:21 UTC - in response to Message 19554. The problem arose after the 10-th trickle the first 2 times and after the 9-th the third time Yep, that sounds like the 4.23 problem. All five of of my failures crashed between 9 and 13 trickles into the run. ID: 19560 · Reply Quote

old_user147974 Send message Joined: 10 Jan 06 Posts: 3 Credit: 259,190 RAC: 0	Message 19851 - Posted: 1 Feb 2006, 10:28:58 UTC - in response to Message 19555. Last modified: 1 Feb 2006, 10:31:26 UTC Hi, same errors here on an Athlon 1500+ and 3000+ both running Ubuntu Breezy 5.10. Can\'t post the actual 161 error message, as the pasted text confuses the BBcode so only part of the error message gets displayed... The yabsd.out file contains a lot of cryptic stuff, but what I can make out are warnings about computations giving negative values... ID: 19851 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 19974 - Posted: 4 Feb 2006, 13:41:25 UTC Last modified: 4 Feb 2006, 13:50:30 UTC Three more crashed models around trickle 10. That\'s a total of eight now on four different machines. Edit: got a new one created today, but this is the last try. ID: 19974 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 19976 - Posted: 4 Feb 2006, 14:15:31 UTC Although Tolu and Carl know about this problem, I have a feeling it won\'t be fixed in a new sulphur version until after the launch of the coupled model experiment. That is where their time is going now. ID: 19976 · Reply Quote