Thread 'New HADCM3L model'

Author	Message
old_user16868 Send message Joined: 12 Sep 04 Posts: 7 Credit: 515,736 RAC: 0	Message 21334 - Posted: 16 Mar 2006, 12:43:45 UTC Just wondering if anyone has experienced seg-faults with new model experiment. Seg fault occurs at application start, model ever gets to startup phase. running boinc core 5.2.13 RH8 libc 3.2.2. regards Steve R ID: 21334 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 21340 - Posted: 16 Mar 2006, 16:50:12 UTC Last modified: 16 Mar 2006, 16:58:58 UTC Guess it could be the same thing as with the sulphur since it\'s compiled with Debian Woody too. Maybe Stefan Mathes patch will work here too? http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=3822 Mine Fedora C2 and C4 works alright. ID: 21340 · Reply Quote

old_user21637 Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0	Message 21544 - Posted: 24 Mar 2006, 6:30:28 UTC - in response to Message 21340. Last modified: 24 Mar 2006, 6:31:13 UTC Hi guys! Yes, unfortunately the HADCM3L model suffers from the same problem as the sulphur model: it does not work on Red Hat Linux distributions with GLIBC versions greater than 2.2.5 (exclusive). My previous patch for the sulphur model was based on linking the model executables to a local copy of GLIBC 2.2.5, so that it \"thinks\" GLIBC 2.2.5 is installed on the machine. I tried the same approach with HADCM3L but, unfortunately, it does not work. This is because HADCM3L, unlike sulphur, is compiled with GLIBC 2.3. Thus, it requires the functionality of the later versions of the libraries to work properly. Alas, until (and if) the developers look into this, Red Hat Linux machines will not be able to run CPDN. :( Best regards, Stefan. ID: 21544 · Reply Quote

old_user52163 Send message Joined: 4 Feb 05 Posts: 10 Credit: 779,835 RAC: 0	Message 21547 - Posted: 24 Mar 2006, 14:20:41 UTC Last modified: 24 Mar 2006, 14:22:58 UTC FYI: HADCM3L is running on a P4 - 1.6ghz, MKD 10.2 2005LE system, slowly. I took a chance, and re-enabled \'get new work\', downloaded a dataset, and BOINC went into \'earliest deadline first\' mode. It ran thru the Seti and Einstine WU\'s cached, then started the CP WU. D/Led 3/17, deadline of 2/27/2007, Boincview indicated that it would finish in 373 days. It has been running for 6 days, and now should finish in 376 days. Boincview indicates that the Est. Speed = 674 MFIOps, and the trickle report shows 5.02 s/TS. I\'m going to resist the temptation to \'micromanage\', and just let it run. :-) Claude [Edit for some spelling] ID: 21547 · Reply Quote

old_user21637 Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0	Message 21553 - Posted: 24 Mar 2006, 17:28:45 UTC - in response to Message 21547. Hi Cortega! As far as I know, the segmentation fault bug affects only Red Hat Linux distributions. Other distributions (like Mandriva) should work ok... Cheers, Stefan. FYI: HADCM3L is running on a P4 - 1.6ghz, MKD 10.2 2005LE system, slowly. I took a chance, and re-enabled \'get new work\', downloaded a dataset, and BOINC went into \'earliest deadline first\' mode. It ran thru the Seti and Einstine WU\'s cached, then started the CP WU. D/Led 3/17, deadline of 2/27/2007, Boincview indicated that it would finish in 373 days. It has been running for 6 days, and now should finish in 376 days. Boincview indicates that the Est. Speed = 674 MFIOps, and the trickle report shows 5.02 s/TS. I\'m going to resist the temptation to \'micromanage\', and just let it run. :-) Claude [Edit for some spelling] ID: 21553 · Reply Quote

old_user3682 Send message Joined: 30 Aug 04 Posts: 24 Credit: 250,709 RAC: 0	Message 21559 - Posted: 24 Mar 2006, 19:30:55 UTC - in response to Message 21544. Hi guys! Yes, unfortunately the HADCM3L model suffers from the same problem as the sulphur model: it does not work on Red Hat Linux distributions with GLIBC versions greater than 2.2.5 (exclusive). My previous patch for the sulphur model was based on linking the model executables to a local copy of GLIBC 2.2.5, so that it \"thinks\" GLIBC 2.2.5 is installed on the machine. I tried the same approach with HADCM3L but, unfortunately, it does not work. This is because HADCM3L, unlike sulphur, is compiled with GLIBC 2.3. Thus, it requires the functionality of the later versions of the libraries to work properly. Alas, until (and if) the developers look into this, Red Hat Linux machines will not be able to run CPDN. :( Best regards, Stefan. Suse 9.2 has the same problem by the looks of it mine just seg faults out as well most anoying it\'s the most productive box as well .. Pete . ID: 21559 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 21561 - Posted: 24 Mar 2006, 20:28:58 UTC Last modified: 24 Mar 2006, 20:34:14 UTC There is discussion of the issue on the CPDN php Board. http://www.climateprediction.net/board/viewtopic.php?t=3659&highlight=sigsegv Search on \'sigsegv\' on that board brings up more. Edit: An old Thread of this Forum shows that the solution was a new CPDN release: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2078 "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 21561 · Reply Quote

old_user34451 Send message Joined: 31 Dec 04 Posts: 5 Credit: 37,523 RAC: 0	Message 22502 - Posted: 30 Apr 2006, 13:54:45 UTC - in response to Message 21559. Thanks folks - I have this problem on mandriva 2006 distribution. Thing is it seems to start off ok with a work unit but then fall over at some point. Dave ID: 22502 · Reply Quote

copycat Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0	Message 22774 - Posted: 15 May 2006, 21:03:57 UTC - in response to Message 21559. Suse 9.2 has the same problem by the looks of it mine just seg faults out as well most anoying it\'s the most productive box as well .. Pete . That\'s strange. I\'m also mainly running SuSe 9.2 (64-bit), with the occasional sideway to 9.1 (32-bit) due to some things that just seem to run better in a 32-bit OS than 64-bit OS, and it started up just fine. My BOINC 4.43-core seems to misunderstand its EDF-policy thouhg, it\'s EDF-ing a deadline on Nov27 before a Nov22, but since deadlines don\'t matter on CPDN, I don\'t mind. Last log entry: hadcm3lbm_9m1r_05217746 - PH 1 TS 0014689 A - 25/06/1921 00:30 - H:M:S=0009:47:09 AVG= 2.40 DLT= 1.23 ID: 22774 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 22778 - Posted: 16 May 2006, 17:59:33 UTC - in response to Message 22774. Last modified: 16 May 2006, 18:11:59 UTC Suse 9.2 has the same problem by the looks of it mine just seg faults out as well most anoying it\'s the most productive box as well .. Pete . That\'s strange. I\'m also mainly running SuSe 9.2 (64-bit), with the occasional sideway to 9.1 (32-bit) due to some things that just seem to run better in a 32-bit OS than 64-bit OS, and it started up just fine. My BOINC 4.43-core seems to misunderstand its EDF-policy thouhg, it\'s EDF-ing a deadline on Nov27 before a Nov22, but since deadlines don\'t matter on CPDN, I don\'t mind. Last log entry: hadcm3lbm_9m1r_05217746 - PH 1 TS 0014689 A - 25/06/1921 00:30 - H:M:S=0009:47:09 AVG= 2.40 DLT= 1.23 boinc versions 4.* are obsolete. Your machine has a string of errors of various sorts which might be related to the core-client vs. client mismatch. Suggest you download a current 5.n version of boinc. The machine is also light on memory for running a pair of these Coupled Models. 512 Meg is recommended minimum for a single Model; the box shows 497 Meg total... FWIW, I have CPDN Seasonal running on boinc 5.2.13, on an A64 X2 4400+, with 64-bit SuSE 10.0; no problems. Also on P4 2.8 & 3.4, boinc 5.2.13, with 32-bit SuSE 10.0, also no problems (if you don\'t count a hardware crash on the 3.4!) [Edited for typo.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 22778 · Reply Quote

copycat Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0	Message 22806 - Posted: 18 May 2006, 21:53:36 UTC - in response to Message 22778. boinc versions 4.* are obsolete. Your machine has a string of errors of various sorts which might be related to the core-client vs. client mismatch. Suggest you download a current 5.n version of boinc. Of that string of errors a maximum of one could be related to a core-client vs client-mismatch. All the other errors are due to a faulty memory module. I\'ve already downloaded a 5.2.8 version a time ago, installing it is another thing though... ID: 22806 · Reply Quote

old_user11965 Send message Joined: 4 Sep 04 Posts: 61 Credit: 80,585 RAC: 0	Message 23966 - Posted: 17 Aug 2006, 1:06:18 UTC Is there still a problem with this model? I just awoke to find that I\'ve downloaded two workunits that both finished with a \"Computation error\". The hadcm3l version was 5.15. Both projects show ~49-50 min. of runtime in BOINC, but BOINC Manager indicates zero CPU time used. Subsequently, errors are reported, presumably due to the lack of result files being created: 2006-08-17 08:02:54 [climateprediction.net] Unrecoverable error for result hadcm3lbm_9hxy_25212425_1 (<file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_1.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_2.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_3.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_4.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_5.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_6.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_7.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadcm3lbm_9hxy_25212425_1_8.zip</file_name> <error Running Slackware 10.2 here. ID: 23966 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 23993 - Posted: 17 Aug 2006, 18:10:47 UTC Last modified: 17 Aug 2006, 18:11:56 UTC Hi, Trane, -161 masks the real problem. Anything significant under the Messages Tab? We\'ve found some commonality underlying -161, whether with machines new to the Project or formerly stable CPDN machines, and some general suggestions for dealing with it: Mike\'s post suggests ways to avoid crashes (Solutions to models crashing: -161 error, or -1073741819 (0xc0000005)): http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=4231 [Edited for typo.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 23993 · Reply Quote

old_user11965 Send message Joined: 4 Sep 04 Posts: 61 Credit: 80,585 RAC: 0	Message 24002 - Posted: 17 Aug 2006, 22:10:45 UTC - in response to Message 23993. -161 masks the real problem. Anything significant under the Messages Tab? Nothing in the messages tab, but I did see something in my stdout log file: Model timeout at 720.00 seconds Model restart required... Preparing for restart... Rewinding a model-day... Starting model ID hadcm3lbm_9hxy_25212425 Phase 1 Climate model starting - use graphics to monitor progress. Or visit the website to see the graphs for this run. That\'s all I could find. ID: 24002 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 24008 - Posted: 17 Aug 2006, 23:49:52 UTC If the program finds something impossible in the model, (\'Negative pressure in a cell\' is/was a common one), then it will rewind the model by a day and try agian. If the problem is still there, then it will rewind one month; and as a last resort, one year. After this, it gives up, issues an error message, and aborts the model. BUT. The model has to have been actually RUNNING for this long for there to be a restart file to go back to. Which doesn\'t seem to be the case for your model, so it \'crashed into the wall behind it\'. ID: 24008 · Reply Quote

old_user11965 Send message Joined: 4 Sep 04 Posts: 61 Credit: 80,585 RAC: 0	Message 24038 - Posted: 19 Aug 2006, 10:53:31 UTC Since it happened to two successive workunits, I have to think that it\'s an interaction between the client app and my distribution. Especially since SAP is happily crunching away at the moment. :) ID: 24038 · Reply Quote