climateprediction.net (CPDN) home page
Thread 'New HADCM3L model'

Thread 'New HADCM3L model'

Questions and Answers : Unix/Linux : New HADCM3L model
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user16868

Send message
Joined: 12 Sep 04
Posts: 7
Credit: 515,736
RAC: 0
Message 21334 - Posted: 16 Mar 2006, 12:43:45 UTC

Just wondering if anyone has experienced seg-faults with new model experiment. Seg fault occurs at application start, model ever gets to startup phase.

running boinc core 5.2.13
RH8 libc 3.2.2.

regards

Steve R
ID: 21334 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 21340 - Posted: 16 Mar 2006, 16:50:12 UTC
Last modified: 16 Mar 2006, 16:58:58 UTC

Guess it could be the same thing as with the sulphur since it\'s compiled with Debian Woody too.
Maybe Stefan Mathes patch will work here too?
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=3822


Mine Fedora C2 and C4 works alright.
ID: 21340 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 21544 - Posted: 24 Mar 2006, 6:30:28 UTC - in response to Message 21340.  
Last modified: 24 Mar 2006, 6:31:13 UTC

Hi guys!

Yes, unfortunately the HADCM3L model suffers from the same problem as the sulphur model: it does not work on Red Hat Linux distributions with GLIBC versions greater than 2.2.5 (exclusive).

My previous patch for the sulphur model was based on linking the model executables to a local copy of GLIBC 2.2.5, so that it \"thinks\" GLIBC 2.2.5 is installed on the machine.

I tried the same approach with HADCM3L but, unfortunately, it does not work. This is because HADCM3L, unlike sulphur, is compiled with GLIBC 2.3. Thus, it requires the functionality of the later versions of the libraries to work properly. Alas, until (and if) the developers look into this, Red Hat Linux machines will not be able to run CPDN. :(

Best regards,
Stefan.
ID: 21544 · Report as offensive     Reply Quote
old_user52163

Send message
Joined: 4 Feb 05
Posts: 10
Credit: 779,835
RAC: 0
Message 21547 - Posted: 24 Mar 2006, 14:20:41 UTC
Last modified: 24 Mar 2006, 14:22:58 UTC

FYI:

HADCM3L is running on a P4 - 1.6ghz, MKD 10.2 2005LE system, slowly.

I took a chance, and re-enabled \'get new work\', downloaded a dataset, and BOINC went into \'earliest deadline first\' mode. It ran thru the Seti and Einstine WU\'s cached, then started the CP WU.

D/Led 3/17, deadline of 2/27/2007, Boincview indicated that it would finish in 373 days. It has been running for 6 days, and now should finish in 376 days.

Boincview indicates that the Est. Speed = 674 MFIOps, and the trickle report shows 5.02 s/TS.

I\'m going to resist the temptation to \'micromanage\', and just let it run.

:-)

Claude
[Edit for some spelling]

ID: 21547 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 21553 - Posted: 24 Mar 2006, 17:28:45 UTC - in response to Message 21547.  

Hi Cortega!

As far as I know, the segmentation fault bug affects only Red Hat Linux distributions. Other distributions (like Mandriva) should work ok...

Cheers,
Stefan.

FYI:

HADCM3L is running on a P4 - 1.6ghz, MKD 10.2 2005LE system, slowly.

I took a chance, and re-enabled \'get new work\', downloaded a dataset, and BOINC went into \'earliest deadline first\' mode. It ran thru the Seti and Einstine WU\'s cached, then started the CP WU.

D/Led 3/17, deadline of 2/27/2007, Boincview indicated that it would finish in 373 days. It has been running for 6 days, and now should finish in 376 days.

Boincview indicates that the Est. Speed = 674 MFIOps, and the trickle report shows 5.02 s/TS.

I\'m going to resist the temptation to \'micromanage\', and just let it run.

:-)

Claude
[Edit for some spelling]


ID: 21553 · Report as offensive     Reply Quote
old_user3682

Send message
Joined: 30 Aug 04
Posts: 24
Credit: 250,709
RAC: 0
Message 21559 - Posted: 24 Mar 2006, 19:30:55 UTC - in response to Message 21544.  

Hi guys!

Yes, unfortunately the HADCM3L model suffers from the same problem as the sulphur model: it does not work on Red Hat Linux distributions with GLIBC versions greater than 2.2.5 (exclusive).

My previous patch for the sulphur model was based on linking the model executables to a local copy of GLIBC 2.2.5, so that it \"thinks\" GLIBC 2.2.5 is installed on the machine.

I tried the same approach with HADCM3L but, unfortunately, it does not work. This is because HADCM3L, unlike sulphur, is compiled with GLIBC 2.3. Thus, it requires the functionality of the later versions of the libraries to work properly. Alas, until (and if) the developers look into this, Red Hat Linux machines will not be able to run CPDN. :(

Best regards,
Stefan.


Suse 9.2 has the same problem by the looks of it mine just seg faults out as well most anoying it\'s the most productive box as well ..

Pete .


ID: 21559 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 21561 - Posted: 24 Mar 2006, 20:28:58 UTC
Last modified: 24 Mar 2006, 20:34:14 UTC

There is discussion of the issue on the CPDN php Board. http://www.climateprediction.net/board/viewtopic.php?t=3659&highlight=sigsegv

Search on \'sigsegv\' on that board brings up more.

Edit: An old Thread of this Forum shows that the solution was a new CPDN release:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2078
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 21561 · Report as offensive     Reply Quote
old_user34451

Send message
Joined: 31 Dec 04
Posts: 5
Credit: 37,523
RAC: 0
Message 22502 - Posted: 30 Apr 2006, 13:54:45 UTC - in response to Message 21559.  

Thanks folks - I have this problem on mandriva 2006 distribution. Thing is it seems to start off ok with a work unit but then fall over at some point.

Dave

ID: 22502 · Report as offensive     Reply Quote
copycat

Send message
Joined: 24 Feb 05
Posts: 28
Credit: 121,749
RAC: 0
Message 22774 - Posted: 15 May 2006, 21:03:57 UTC - in response to Message 21559.  

Suse 9.2 has the same problem by the looks of it mine just seg faults out as well most anoying it\'s the most productive box as well ..

Pete .

That\'s strange.

I\'m also mainly running SuSe 9.2 (64-bit), with the occasional sideway to 9.1 (32-bit) due to some things that just seem to run better in a 32-bit OS than 64-bit OS, and it started up just fine. My BOINC 4.43-core seems to misunderstand its EDF-policy thouhg, it\'s EDF-ing a deadline on Nov27 before a Nov22, but since deadlines don\'t matter on CPDN, I don\'t mind.

Last log entry: hadcm3lbm_9m1r_05217746 - PH 1 TS 0014689 A - 25/06/1921 00:30 - H:M:S=0009:47:09 AVG= 2.40 DLT= 1.23
ID: 22774 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 22778 - Posted: 16 May 2006, 17:59:33 UTC - in response to Message 22774.  
Last modified: 16 May 2006, 18:11:59 UTC

Suse 9.2 has the same problem by the looks of it mine just seg faults out as well most anoying it\'s the most productive box as well ..

Pete .

That\'s strange.

I\'m also mainly running SuSe 9.2 (64-bit), with the occasional sideway to 9.1 (32-bit) due to some things that just seem to run better in a 32-bit OS than 64-bit OS, and it started up just fine. My BOINC 4.43-core seems to misunderstand its EDF-policy thouhg, it\'s EDF-ing a deadline on Nov27 before a Nov22, but since deadlines don\'t matter on CPDN, I don\'t mind.

Last log entry: hadcm3lbm_9m1r_05217746 - PH 1 TS 0014689 A - 25/06/1921 00:30 - H:M:S=0009:47:09 AVG= 2.40 DLT= 1.23

boinc versions 4.* are obsolete. Your machine has a string of errors of various sorts which might be related to the core-client vs. client mismatch. Suggest you download a current 5.n version of boinc.

The machine is also light on memory for running a pair of these Coupled Models. 512 Meg is recommended minimum for a single Model; the box shows 497 Meg total...

FWIW, I have CPDN Seasonal running on boinc 5.2.13, on an A64 X2 4400+, with 64-bit SuSE 10.0; no problems. Also on P4 2.8 & 3.4, boinc 5.2.13, with 32-bit SuSE 10.0, also no problems (if you don\'t count a hardware crash on the 3.4!)

[Edited for typo.]
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 22778 · Report as offensive     Reply Quote
copycat

Send message
Joined: 24 Feb 05
Posts: 28
Credit: 121,749
RAC: 0
Message 22806 - Posted: 18 May 2006, 21:53:36 UTC - in response to Message 22778.  

boinc versions 4.* are obsolete. Your machine has a string of errors of various sorts which might be related to the core-client vs. client mismatch. Suggest you download a current 5.n version of boinc.

Of that string of errors a maximum of one could be related to a core-client vs client-mismatch. All the other errors are due to a faulty memory module. I\'ve already downloaded a 5.2.8 version a time ago, installing it is another thing though...
ID: 22806 · Report as offensive     Reply Quote
Profileold_user11965

Send message
Joined: 4 Sep 04
Posts: 61
Credit: 80,585
RAC: 0
Message 23966 - Posted: 17 Aug 2006, 1:06:18 UTC

Is there still a problem with this model? I just awoke to find that I\'ve downloaded two workunits that both finished with a \"Computation error\". The hadcm3l version was 5.15. Both projects show ~49-50 min. of runtime in BOINC, but BOINC Manager indicates zero CPU time used. Subsequently, errors are reported, presumably due to the lack of result files being created:

2006-08-17 08:02:54 [climateprediction.net] Unrecoverable error for result hadcm3lbm_9hxy_25212425_1 (<file_xfer_error>
  <file_name>hadcm3lbm_9hxy_25212425_1_1.zip</file_name>
  <error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>hadcm3lbm_9hxy_25212425_1_2.zip</file_name>
  <error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>hadcm3lbm_9hxy_25212425_1_3.zip</file_name>
  <error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>hadcm3lbm_9hxy_25212425_1_4.zip</file_name>
  <error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>hadcm3lbm_9hxy_25212425_1_5.zip</file_name>
  <error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>hadcm3lbm_9hxy_25212425_1_6.zip</file_name>
  <error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>hadcm3lbm_9hxy_25212425_1_7.zip</file_name>
  <error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>hadcm3lbm_9hxy_25212425_1_8.zip</file_name>
  <error


Running Slackware 10.2 here.
ID: 23966 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 23993 - Posted: 17 Aug 2006, 18:10:47 UTC
Last modified: 17 Aug 2006, 18:11:56 UTC

Hi, Trane,

-161 masks the real problem. Anything significant under the Messages Tab?

We\'ve found some commonality underlying -161, whether with machines new to the Project or formerly stable CPDN machines, and some general suggestions for dealing with it:
Mike\'s post suggests ways to avoid crashes (Solutions to models crashing: -161 error, or -1073741819 (0xc0000005)):
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=4231


[Edited for typo.]
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 23993 · Report as offensive     Reply Quote
Profileold_user11965

Send message
Joined: 4 Sep 04
Posts: 61
Credit: 80,585
RAC: 0
Message 24002 - Posted: 17 Aug 2006, 22:10:45 UTC - in response to Message 23993.  

-161 masks the real problem. Anything significant under the Messages Tab?

Nothing in the messages tab, but I did see something in my stdout log file:

Model timeout at 720.00 seconds
Model restart required...
Preparing for restart...
Rewinding a model-day...
Starting model ID hadcm3lbm_9hxy_25212425   Phase 1
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.


That\'s all I could find.
ID: 24002 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 24008 - Posted: 17 Aug 2006, 23:49:52 UTC

If the program finds something impossible in the model, (\'Negative pressure in a cell\' is/was a common one), then it will rewind the model by a day and try agian.
If the problem is still there, then it will rewind one month; and as a last resort, one year.
After this, it gives up, issues an error message, and aborts the model.

BUT.
The model has to have been actually RUNNING for this long for there to be a restart file to go back to. Which doesn\'t seem to be the case for your model, so it \'crashed into the wall behind it\'.

ID: 24008 · Report as offensive     Reply Quote
Profileold_user11965

Send message
Joined: 4 Sep 04
Posts: 61
Credit: 80,585
RAC: 0
Message 24038 - Posted: 19 Aug 2006, 10:53:31 UTC

Since it happened to two successive workunits, I have to think that it\'s an interaction between the client app and my distribution. Especially since SAP is happily crunching away at the moment. :)
ID: 24038 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : New HADCM3L model

©2024 cpdn.org