climateprediction.net (CPDN) home page
Thread 'Where do all the errors come from?'

Thread 'Where do all the errors come from?'

Message boards : Number crunching : Where do all the errors come from?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31565 - Posted: 2 Dec 2007, 11:37:02 UTC


Did you install IA32 support in Ubuntu? (I gather that Ubuntu 64-bit does NOT support 32-bit apps by default).

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31565 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 31568 - Posted: 2 Dec 2007, 18:15:21 UTC

If the answer to Mike\'s question is positive, have you recently run stability checks on the machine? Unless I looked at the wrong entries, the machine had 10 Models in WinXP SP1, none of which were successful. (Yes, three show \'Success\' but they, too, failed; boinc entries must be taken with a grain of salt.)

24 hours of dual Prime-95 wouldn\'t hurt. Just to be sure.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 31568 · Report as offensive     Reply Quote
ProfileSkip Da Shu
Avatar

Send message
Joined: 31 Aug 04
Posts: 42
Credit: 15,308,708
RAC: 298
Message 31570 - Posted: 3 Dec 2007, 0:13:06 UTC - in response to Message 31565.  
Last modified: 3 Dec 2007, 0:42:37 UTC


Did you install IA32 support in Ubuntu? (I gather that Ubuntu 64-bit does NOT support 32-bit apps by default).

I used a package manager and found an IA32 library noted as \'shared 32bit libs for AMD64 system\'. Installed that and it has resolved the code 22 on QMC and E@H WUs. Assuming it\'ll do the same for CPDN but have a couple more hours before I can get another one to verify.

Thank you VERY much.

UPDATE: A CPDN HadSM3 Slab WU has now running for about 5 minutes. :-)
ID: 31570 · Report as offensive     Reply Quote
ProfileSkip Da Shu
Avatar

Send message
Joined: 31 Aug 04
Posts: 42
Credit: 15,308,708
RAC: 298
Message 31571 - Posted: 3 Dec 2007, 0:19:43 UTC - in response to Message 31568.  
Last modified: 3 Dec 2007, 0:28:00 UTC

If the answer to Mike\'s question is positive, have you recently run stability checks on the machine? Unless I looked at the wrong entries, the machine had 10 Models in WinXP SP1, none of which were successful. (Yes, three show \'Success\' but they, too, failed; boinc entries must be taken with a grain of salt.)

24 hours of dual Prime-95 wouldn\'t hurt. Just to be sure.


The problem appears to be the IA32 libs. However, it appears I tested stable under WinXP SP1 with FSB set at 216 and default vcore. Then at some point unknown bumped the vcore a notch and the FSB to 218 and didn\'t run a full test (OCCT or Prime95). I haven\'t got Prime95 installed yet but backed the FSB down to 215 and left vcore up 1 notch just to be safe until I can find/install Prime95 or another stress tester. Thanx

- da shu @ HeliOS,
"Free software is a matter of liberty, not price. To understand the concept, you should think of free as in free speech, not as in free beer"
ID: 31571 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31574 - Posted: 3 Dec 2007, 10:30:19 UTC


Glad to know that my wild guess was right :-)


The Linux version of Prime95 is called \'mprime\', and can be downloaded from the usual place (http://www.mersenne.org/). I\'d recommend the statically linked version. Note that you\'ll need to run one copy per processor core, using the \'affinity\' command-line flag (-A0 / -A1 / etc).

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31574 · Report as offensive     Reply Quote
Profileold_user211396
Avatar

Send message
Joined: 2 Dec 06
Posts: 3
Credit: 894,841
RAC: 0
Message 31577 - Posted: 3 Dec 2007, 18:47:35 UTC

Hi,

I\'ve also a question to a crashed model. Can anyone explain what the exit status -197 (0xffffff3b) means.

thx and greets

jan
ID: 31577 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 31581 - Posted: 3 Dec 2007, 23:31:58 UTC - in response to Message 31577.  

Hi,

I\'ve also a question to a crashed model. Can anyone explain what the exit status -197 (0xffffff3b) means.

thx and greets

jan


It means \'user abort\'. This is what happens to a model when the \'Abort\' button is pressed in the \'Tasks\' tab of BOINC Manager.
ID: 31581 · Report as offensive     Reply Quote
AM

Send message
Joined: 3 Dec 05
Posts: 3
Credit: 1,671,830
RAC: 272
Message 31772 - Posted: 18 Dec 2007, 15:41:05 UTC

Just wanted to air how important I think this project is, the most important issue of teh 21st Century by a long shot. I am also extremely frustrated that I have never completed an entrie model with multiple \"computation errors.\" Now BOINC doesn\'t even give the reason for the error.

I have searched the forums and asked for help and all I receive is techno-speak.

I sometimes feel this project is clearly not meant for those without a programming background.

Therefore, it pains me to say, I am detaching from this project.
ID: 31772 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31773 - Posted: 18 Dec 2007, 15:45:09 UTC
Last modified: 18 Dec 2007, 15:46:09 UTC

Which query did you have a problem with? The only one I could find on this forum was :
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=5860 (is it on a different forum?).



I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31773 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31774 - Posted: 18 Dec 2007, 16:32:10 UTC

Hi Andrew

I can see that your more recent crashed models don\'t indicate any real reasons, but the older crashes give us clues about probable specific reasons for which we already have an advice post written in, we think, normal language. If you post to say you\'re interested in diagnoses and likely cures, we can talk about the details.

In my view this would be worth while for you because the same problems could well beset your tasks on other projects, though to a lesser extent because the tasks will be shorter.
Cpdn news
ID: 31774 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32016 - Posted: 5 Jan 2008, 15:38:53 UTC

Andrew has sent me a PM asking for clear advice about avoiding model crashes, so here goes. Most members who aren\'t computer experts crash quite a few models while they learn how to keep their models going.

Andrew, your computer specs with 3/4Gb RAM are well up to the task of crunching HADCM and HADSM models but not the HADAM type, so in your CPDN project preferences don\'t check the HADAM option. When you need a new model, if you want a shorter type, select HADSM.

Here are the models you\'ve had

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=286186

Most of the crashes were with 107, -107 and 1 errors. The most likely cause would be closing down the computer without exiting from BOINC first. If you do this, sooner or later a shutdown will catch the model doing something important and crash it. Shutting down the model through Task manager (not Boinc manager) will have the same effect. So it\'s a good idea to

* go into BOINC manager and suspend activity (in the Activity menu)
* then close down BOINC manager by clicking the X
* then exit from BOINC by right-clicking on the icon, bottom right of screen, and select Exit
* wait till the icon disappears
* then begin to shut down the computer

There are a lot of other handy tips about what to do and what not to do in the CPDN README posts on the independent forum (where members have to register separately to post) 3rd section from the top:
http://www.climateprediction.net/board/index.php

Here are all the READMEs

http://www.climateprediction.net/board/viewforum.php?f=44&sid=d434c4d477dcc2799bdc2ff37672f942

Everyone suffering from crashed models would do well to look at the README about Crashes and problems. Item #5 by MikeMars explains all the common problems and useful precautions.

Same README - item #1 by Les explains click-by-click a really easy backup method. By making regular backups, if your model does crash you can restore it and continue crunching the same model. Backups have rescued hundreds if not thousands of crashed models.

Same README, item #6 by Thyme Lawn explains how to update a computer\'s graphics drivers. If they\'re out of date, models can crash with -107 and 1 errors.

Hope that\'s useful. Andrew, if you want anything explained in more detail, just post back.
Cpdn news
ID: 32016 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Where do all the errors come from?

©2024 cpdn.org