Questions and Answers : Unix/Linux : New HADCM3L model
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 04 Posts: 7 Credit: 515,736 RAC: 0 |
Just wondering if anyone has experienced seg-faults with new model experiment. Seg fault occurs at application start, model ever gets to startup phase. running boinc core 5.2.13 RH8 libc 3.2.2. regards Steve R |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
Guess it could be the same thing as with the sulphur since it\'s compiled with Debian Woody too. Maybe Stefan Mathes patch will work here too? http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=3822 Mine Fedora C2 and C4 works alright. |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Hi guys! Yes, unfortunately the HADCM3L model suffers from the same problem as the sulphur model: it does not work on Red Hat Linux distributions with GLIBC versions greater than 2.2.5 (exclusive). My previous patch for the sulphur model was based on linking the model executables to a local copy of GLIBC 2.2.5, so that it \"thinks\" GLIBC 2.2.5 is installed on the machine. I tried the same approach with HADCM3L but, unfortunately, it does not work. This is because HADCM3L, unlike sulphur, is compiled with GLIBC 2.3. Thus, it requires the functionality of the later versions of the libraries to work properly. Alas, until (and if) the developers look into this, Red Hat Linux machines will not be able to run CPDN. :( Best regards, Stefan. |
Send message Joined: 4 Feb 05 Posts: 10 Credit: 779,835 RAC: 0 |
FYI: HADCM3L is running on a P4 - 1.6ghz, MKD 10.2 2005LE system, slowly. I took a chance, and re-enabled \'get new work\', downloaded a dataset, and BOINC went into \'earliest deadline first\' mode. It ran thru the Seti and Einstine WU\'s cached, then started the CP WU. D/Led 3/17, deadline of 2/27/2007, Boincview indicated that it would finish in 373 days. It has been running for 6 days, and now should finish in 376 days. Boincview indicates that the Est. Speed = 674 MFIOps, and the trickle report shows 5.02 s/TS. I\'m going to resist the temptation to \'micromanage\', and just let it run. :-) Claude [Edit for some spelling] |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Hi Cortega! As far as I know, the segmentation fault bug affects only Red Hat Linux distributions. Other distributions (like Mandriva) should work ok... Cheers, Stefan. FYI: |
Send message Joined: 30 Aug 04 Posts: 24 Credit: 250,709 RAC: 0 |
Hi guys! Suse 9.2 has the same problem by the looks of it mine just seg faults out as well most anoying it\'s the most productive box as well .. Pete . |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
There is discussion of the issue on the CPDN php Board. http://www.climateprediction.net/board/viewtopic.php?t=3659&highlight=sigsegv Search on \'sigsegv\' on that board brings up more. Edit: An old Thread of this Forum shows that the solution was a new CPDN release: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2078 "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 31 Dec 04 Posts: 5 Credit: 37,523 RAC: 0 |
Thanks folks - I have this problem on mandriva 2006 distribution. Thing is it seems to start off ok with a work unit but then fall over at some point. Dave |
Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0 |
Suse 9.2 has the same problem by the looks of it mine just seg faults out as well most anoying it\'s the most productive box as well .. That\'s strange. I\'m also mainly running SuSe 9.2 (64-bit), with the occasional sideway to 9.1 (32-bit) due to some things that just seem to run better in a 32-bit OS than 64-bit OS, and it started up just fine. My BOINC 4.43-core seems to misunderstand its EDF-policy thouhg, it\'s EDF-ing a deadline on Nov27 before a Nov22, but since deadlines don\'t matter on CPDN, I don\'t mind. Last log entry: hadcm3lbm_9m1r_05217746 - PH 1 TS 0014689 A - 25/06/1921 00:30 - H:M:S=0009:47:09 AVG= 2.40 DLT= 1.23 |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Suse 9.2 has the same problem by the looks of it mine just seg faults out as well most anoying it\'s the most productive box as well .. boinc versions 4.* are obsolete. Your machine has a string of errors of various sorts which might be related to the core-client vs. client mismatch. Suggest you download a current 5.n version of boinc. The machine is also light on memory for running a pair of these Coupled Models. 512 Meg is recommended minimum for a single Model; the box shows 497 Meg total... FWIW, I have CPDN Seasonal running on boinc 5.2.13, on an A64 X2 4400+, with 64-bit SuSE 10.0; no problems. Also on P4 2.8 & 3.4, boinc 5.2.13, with 32-bit SuSE 10.0, also no problems (if you don\'t count a hardware crash on the 3.4!) [Edited for typo.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0 |
boinc versions 4.* are obsolete. Your machine has a string of errors of various sorts which might be related to the core-client vs. client mismatch. Suggest you download a current 5.n version of boinc. Of that string of errors a maximum of one could be related to a core-client vs client-mismatch. All the other errors are due to a faulty memory module. I\'ve already downloaded a 5.2.8 version a time ago, installing it is another thing though... |
Send message Joined: 4 Sep 04 Posts: 61 Credit: 80,585 RAC: 0 |
Is there still a problem with this model? I just awoke to find that I\'ve downloaded two workunits that both finished with a \"Computation error\". The hadcm3l version was 5.15. Both projects show ~49-50 min. of runtime in BOINC, but BOINC Manager indicates zero CPU time used. Subsequently, errors are reported, presumably due to the lack of result files being created: 2006-08-17 08:02:54 [climateprediction.net] Unrecoverable error for result hadcm3lbm_9hxy_25212425_1 (<file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_1.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_2.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_3.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_4.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_5.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_6.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_7.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_8.zip</file_name> <error Running Slackware 10.2 here. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Hi, Trane, -161 masks the real problem. Anything significant under the Messages Tab? We\'ve found some commonality underlying -161, whether with machines new to the Project or formerly stable CPDN machines, and some general suggestions for dealing with it: Mike\'s post suggests ways to avoid crashes (Solutions to models crashing: -161 error, or -1073741819 (0xc0000005)): http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=4231 [Edited for typo.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 4 Sep 04 Posts: 61 Credit: 80,585 RAC: 0 |
-161 masks the real problem. Anything significant under the Messages Tab? Nothing in the messages tab, but I did see something in my stdout log file: Model timeout at 720.00 seconds Model restart required... Preparing for restart... Rewinding a model-day... Starting model ID hadcm3lbm_9hxy_25212425 Phase 1 Climate model starting - use graphics to monitor progress. Or visit the website to see the graphs for this run. That\'s all I could find. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
If the program finds something impossible in the model, (\'Negative pressure in a cell\' is/was a common one), then it will rewind the model by a day and try agian. If the problem is still there, then it will rewind one month; and as a last resort, one year. After this, it gives up, issues an error message, and aborts the model. BUT. The model has to have been actually RUNNING for this long for there to be a restart file to go back to. Which doesn\'t seem to be the case for your model, so it \'crashed into the wall behind it\'. |
Send message Joined: 4 Sep 04 Posts: 61 Credit: 80,585 RAC: 0 |
Since it happened to two successive workunits, I have to think that it\'s an interaction between the client app and my distribution. Especially since SAP is happily crunching away at the moment. :) |
©2024 cpdn.org