Questions and Answers : Unix/Linux : Linux Sulphur 4.23 Unstable
Message board moderation
Author | Message |
---|---|
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Edit...** It appears that the sulphur 4.23 Linux model is completely unstable. The programmers know about it, but their time is consumed with finishing up development of the next experiment. Suggestion is to run other BOINC projects until this is fixed, or until the next experiment is released (Beta IS stable on Linux). Those running Linux sulphur 4.22 and previous should be relatively stable. ** Edit 2...** LS Diseño has figured out a way to revert from 4.23 to 4.21 and has instructions for doing so below. You may want to give this a try if you want to continue crunching sulphur with Linux. ** Edit 3...** The coupled model, hadcm3l, might now be downloaded from the climateprediction.net site. The model is stable, compared to sulphur 4.23 anyway, but some users still have troubles with it. It is essentially the same model as being run at the BBC site, and you may want to peruse the BBC Linux board if you run into problems. A dual Xeon 2.8 GHz running FC3 was running two sulphur 4.23 models. Each of them crashed during June of 1818, shortly after the halfway point of phase 1. Nothing intelligible was found at the end of yabsd.out in either model to help determine why they crashed. This PC has been very stable up until this point. The Results for this hostID are 1614429 and 1614592. Below is the terminal log of the last crash. Doesn\'t look like anything helpful in this logging. This looks like a problem with 4.23 or the WUs. sulphur_ilwx_000868209 - PH 1 TS 0130189 A - 13/06/1818 06:30 - H:M:S=0139:40:28 AVG= 3.86 DLT= 2.00 sulphur_ixod_000883453 - PH 1 TS 0008121 A - 20/05/1811 04:30 - H:M:S=0008:02:09 AVG= 3.56 DLT= 2.00 sulphur_ixod_000883453 - PH 1 TS 0008122 A - 20/05/1811 05:00 - H:M:S=0008:02:12 AVG= 3.56 DLT= 3.00 Preparing for restart... Error: Restart files for not found Giving up, this result exceeded crash count for available restart files. deflating : restart.day deflating : yabsd.out sulphur_ixod_000883453 - PH 1 TS 0008123 A - 20/05/1811 05:30 - H:M:S=0008:02:14 AVG= 3.56 DLT= 2.00 2006-01-10 09:57:27 [---] request_reschedule_cpus: process exited 2006-01-10 09:57:27 [climateprediction.net] Computation for result sulphur_ilwx_000868209_0 finished sulphur_ixod_000883453 - PH 1 TS 0008124 A - 20/05/1811 06:00 - H:M:S=0008:02:16 AVG= 3.56 DLT= 2.00 2006-01-10 09:57:28 [climateprediction.net] Unrecoverable error for result sulphur_ilwx_000868209_0 ({file_xfer_error} {file_name}sulphur_ilwx_000868209_0_1.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_2.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_3.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_4.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_5.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} ) 2006-01-10 09:57:28 [climateprediction.net] Unrecoverable error for result sulphur_ilwx_000868209_0 ({file_xfer_error} {file_name}sulphur_ilwx_000868209_0_1.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_2.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_3.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_4.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} {file_xfer_error} {file_name}sulphur_ilwx_000868209_0_5.zip{/file_name} {error_code}-161{/error_code} {error_message}{/error_message} {/file_xfer_error} ) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I notice this near the start of the log: quote Error: Restart files for not foundunquote I wonder if this is just a grammatical mistake, or does it mean that some part of the error code can\'t find the model name? It sure is a hard slog to get Linux to complete a sulphur model. :( edit It looks like one has to include the word quote oneself. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I notice this near the start of the log: Typically it has something like \"dataout/restart.year\" after the \"for\". But it was obvious that it had already rewound previously because it was moving along at 3.86 s/TS, when yesterday it was doing 3.45 s/TS. |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
My first two models on AMD XP also shows crashes. The first one gave up after TS 113528 26/6/1817 04:00. It also had crash&rewinds after trickle 8 (TS 86416). Terminal log looks similar to geophi\'s and nothing looks special at the end of yabsd.out. Model 1617143 The second one has had three crash@rewinds now but is still running. It looked like this on the terminal: sulphur_irud_000875893 - PH 1 TS 0113997 A - 05/07/1817 22:30 - H:M:S=0110:41:48 AVG= 3.50 DLT= 3.00 sulphur_irud_000875893 - PH 1 TS 0113998 A - 05/07/1817 23:00 - H:M:S=0110:41:50 AVG= 3.50 DLT= 2.00 Preparing for restart... Rewinding a model-day... Starting model ID sulphur_irud_000875893 Phase 1 Getting pthread attributes - retval=0 Setting pthread size (66560000 bytes) - retval=0 Waiting for model startup, this may take a minute... sulphur_irud_000875893 - PH 1 TS 0113905 A - 04/07/1817 00:30 - H:M:S=0110:41:52 AVG= 3.50 DLT= 0.00 sulphur_irud_000875893 - PH 1 TS 0113906 A - 04/07/1817 01:00 - H:M:S=0110:42:02 AVG= 3.50 DLT= 9.94 . . sulphur_irud_000875893 - PH 1 TS 0113996 A - 05/07/1817 22:00 - H:M:S=0110:47:18 AVG= 3.50 DLT=10.00 sulphur_irud_000875893 - PH 1 TS 0113997 A - 05/07/1817 22:30 - H:M:S=0110:47:20 AVG= 3.50 DLT= 1.99 Preparing for restart... Rewinding a model-month... Copying restart files for model retry... Starting model ID sulphur_irud_000875893 Phase 1 Getting pthread attributes - retval=0 Setting pthread size (66560000 bytes) - retval=0 Waiting for model startup, this may take a minute... sulphur_irud_000875893 - PH 1 TS 0113761 A - 01/07/1817 00:30 - H:M:S=0110:47:23 AVG= 3.51 DLT= 0.00 sulphur_irud_000875893 - PH 1 TS 0113762 A - 01/07/1817 01:00 - H:M:S=0110:47:34 AVG= 3.51 DLT=10.50 . . sulphur_irud_000875893 - PH 1 TS 0113997 A - 05/07/1817 22:30 - H:M:S=0111:01:09 AVG= 3.51 DLT= 1.91 sulphur_irud_000875893 - PH 1 TS 0113998 A - 05/07/1817 23:00 - H:M:S=0111:01:11 AVG= 3.51 DLT= 1.91 Preparing for restart... Rewinding a model-year... Copying restart files for model retry... Starting model ID sulphur_irud_000875893 Phase 1 Getting pthread attributes - retval=0 Setting pthread size (66560000 bytes) - retval=0 Waiting for model startup, this may take a minute... sulphur_irud_000875893 - PH 1 TS 0103681 A - 01/12/1816 00:30 - H:M:S=0111:01:14 AVG= 3.85 DLT= 0.00 sulphur_irud_000875893 - PH 1 TS 0103682 A - 01/12/1816 01:00 - H:M:S=0111:01:24 AVG= 3.85 DLT=10.28 This is HostID=3880, nice and reliable. I took a copy of yabsd.out but I don\'t know if there\'s something special there, two warning messages looks like this: INITTIME: Warning- New STEP doesn\'t match old value Internal model id 1 Old= 103536 New= 103680 So guess this one will finish too in a couple of hours. Something special I should monitor? |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
The model on host 3880 gave up at TS 112422 03/06 1817 03:00, model 1622417. The bottom of yabsd.out looks like this: Model aborted with error code - 1 Routine and message:- ATM_DYN : NEGATIVE THETA DETECTED. |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
My third model also crashed. |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Got the exact same error as you guys did. My model (running on a Red Hat Linux), died around timestep 113,000. Same error log as mentioned by you. My last backup was before the crash. Now I have a backup with the crashed model. The crash seems to be reproducible, so running from the first backup you can reproduce the crash very easily. So... I would like to announce that I am keeping both archives (from before and after the crash), and if the developers need them for reproducing and debugging the error, I would gladly upload the data. If it could help to take a look at them, just give me an FTP connection on some server or something similar, and I shall upload the two backups. They are around 160 Mb each. Cheers, Stefan. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Two more sulphur 4.23 models have failed on the dual Xeon between 10,000 and 14000 timesteps into the first phase. Not good. I\'m suspending sulphur on that PC now. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Another crash after about 10 trickles on one of my AMD PCs. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1632117 |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Judging by the crashes everyone seems to be getting, it seems that sulphur no longer works on Linux. Thus I have suspended all of my Linux workstations until the developers find the problem. Switched these machines to LHC@Home in the meantime... Hope the problem will be fixed soon. :(( Stefan. |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
|
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I\'ve e-mailed Tolu about it and he knows there\'s a problem. When it will be fixed...given the upcoming coupled launch...?? |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
Oh.. I wish they could switch to slabs then, just for Linux? I only have 1.5 old slabs here now to feed three machines until the HadCM3L launch. (No, I can\'t switch to Windows) |
Send message Joined: 6 Aug 04 Posts: 124 Credit: 9,195,838 RAC: 0 |
My sulphur 4.23 crashed too :-( (A64 X2 x86_64 Gentoo) sulphur_ioa2_000871274 - PH 1 TS 0129201 A - 22/05/1818 16:30 - H:M:S=0103:50:34 AVG= 2.89 DLT= 1.00 sulphur_ioa2_000871274 - PH 1 TS 0129202 A - 22/05/1818 17:00 - H:M:S=0103:50:36 AVG= 2.89 DLT= 2.00 sulphur_ioa2_000871274 - PH 1 TS 0129203 A - 22/05/1818 17:30 - H:M:S=0103:50:38 AVG= 2.89 DLT= 2.00 sulphur_ioa2_000871274 - PH 1 TS 0129204 A - 22/05/1818 18:00 - H:M:S=0103:50:40 AVG= 2.89 DLT= 1.98 sulphur_ioa2_000871274 - PH 1 TS 0129205 A - 22/05/1818 18:30 - H:M:S=0103:50:41 AVG= 2.89 DLT= 0.95 sulphur_ioa2_000871274 - PH 1 TS 0129206 A - 22/05/1818 19:00 - H:M:S=0103:50:49 AVG= 2.89 DLT= 7.99 Preparing for restart... Error: Restart files for not found Giving up, this result exceeded crash count for available restart files. deflating : restart.day deflating : yabsd.out 2006-01-20 15:56:06 [---] request_reschedule_cpus: process exited 2006-01-20 15:56:06 [climateprediction.net] Computation for result sulphur_ioa2_000871274_0 finished 2006-01-20 15:56:06 [LHC@home] Starting result wjan1A_v6s4hvnom_mqx_nc__17__64.269_59.279__4_6__6__85_1_sixvf_boinc81550_3 using sixtrack version 466 2006-01-20 15:56:07 [climateprediction.net] Unrecoverable error for result sulphur_ioa2_000871274_0 (<file_xfer_error> <file_name>sulphur_ioa2_000871274_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> 2006-01-20 15:56:07 [climateprediction.net] Unrecoverable error for result sulphur_ioa2_000871274_0 (<file_xfer_error> 2006-01-20 16:42:37 [LHC@home] Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 2006-01-20 16:42:37 [LHC@home] Reason: To report results 2006-01-20 16:42:37 [LHC@home] Reporting 1 results 2006-01-20 16:42:42 [LHC@home] Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 2006-01-20 17:20:23 [LHC@home] Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi 2006-01-20 17:20:23 [LHC@home] Reason: To fetch work 2006-01-20 17:20:23 [LHC@home] Requesting 12 seconds of new work 2006-01-20 17:20:28 [LHC@home] Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded 2006-01-20 17:20:29 [LHC@home] Started download of woct1_v6s4hvnom_mqx-oct1__5__64.202_59.212__4_6__6__55_1_sixvf_boinc11300.zip 2006-01-20 17:20:31 [LHC@home] Finished download of woct1_v6s4hvnom_mqx-oct1__5__64.202_59.212__4_6__6__55_1_sixvf_boinc11300.zip 2006-01-20 17:20:31 [LHC@home] Throughput 33359 bytes/sec 2006-01-20 17:20:32 [---] request_reschedule_cpus: files downloaded 2006-01-20 17:20:32 [Einstein@Home] Pausing result z1_0361.0__48_S4R2a_2 (removed from memory) 2006-01-20 17:20:32 [LHC@home] Starting result woct1_v6s4hvnom_mqx-oct1__5__64.202_59.212__4_6__6__55_1_sixvf_boinc11300_1 using sixtrack version 466 2006-01-20 17:20:33 [---] request_reschedule_cpus: process exited 2006-01-20 18:20:10 [climateprediction.net] Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi 2006-01-20 18:20:10 [climateprediction.net] Reason: To report results 2006-01-20 18:20:10 [climateprediction.net] Reporting 1 results 2006-01-20 18:20:15 [climateprediction.net] Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded (doh, tags are filtered) Linux Users Everywhere @ BOINC |
Send message Joined: 1 Nov 05 Posts: 2 Credit: 80,395 RAC: 0 |
Same problem for the third time with Sulphur on my Opteron244 SuSE 10.0 64-bit. The problem arose after the 10-th trickle the first 2 times and after the 9-th the third time A fourth file is still running but if it crashes as well, I\'ll stop calculating on Climateprediction files foe the moment Report of the last crash 2006-01-22 21:13:44 [climateprediction.net] Unrecoverable error for result sulphur_i80w_100850208_0 (<file_xfer_error> <file_name>sulphur_i80w_100850208_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_i80w_100850208_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_i80w_100850208_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_i80w_100850208_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_i80w_100850208_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> ) Beer for Linux Users Everywhere |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Beer The 161 errors are a \'red herring\'. If there is any record of the REAL reason for the failure, it will be near the bottom of the file: yabsd.out, which is in the dataout folder of the model. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
The problem arose after the 10-th trickle the first 2 times and after the 9-th the third time Yep, that sounds like the 4.23 problem. All five of of my failures crashed between 9 and 13 trickles into the run. |
Send message Joined: 10 Jan 06 Posts: 3 Credit: 259,190 RAC: 0 |
Hi, same errors here on an Athlon 1500+ and 3000+ both running Ubuntu Breezy 5.10. Can\'t post the actual 161 error message, as the pasted text confuses the BBcode so only part of the error message gets displayed... The yabsd.out file contains a lot of cryptic stuff, but what I can make out are warnings about computations giving negative values... |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
Three more crashed models around trickle 10. That\'s a total of eight now on four different machines. Edit: got a new one created today, but this is the last try. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Although Tolu and Carl know about this problem, I have a feeling it won\'t be fixed in a new sulphur version until after the launch of the coupled model experiment. That is where their time is going now. |
©2024 cpdn.org