climateprediction.net (CPDN) home page
Thread 'Computing error'

Thread 'Computing error'

Message boards : Number crunching : Computing error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,828,627
RAC: 4,993
Message 41090 - Posted: 19 Nov 2010, 10:43:32 UTC

Tullio,

That's a classic FAMOUS 'INVALID THETA DETECTED' crash: the model has six attempts at the end to get past the block and then gives up. FAMOUS models cover a long period of time and it seems that it's quite hard to find a set of parameters that will get a model through an entire run. The parameter set that your model had wasn't viable in the long term. Another one might be.

Iain
ID: 41090 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 41150 - Posted: 24 Nov 2010, 9:30:22 UTC - in response to Message 41090.  

I seem to be getting a lot of models crashing with, "Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED." at the moment running linux with dual core intel. I had a break from it with 2 models completing but the past 2/3 have all crashed with this error.

Dave
ID: 41150 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41151 - Posted: 24 Nov 2010, 9:44:04 UTC - in response to Message 41150.  

It's all part of the experiment - start them off and see how far they get.

FAMOUS is a bit different to previous experiments, in that some batches of them have extreme values for some of the variables.

And those with start year of 599 are spinups; what they'll do is anyone's guess. But those that survive for the full run are then used to start a new series stretching off into the future.


Backups: Here
ID: 41151 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 41153 - Posted: 24 Nov 2010, 13:57:13 UTC - in response to Message 41151.  

Thanks Les, I had read that somewhere but had forgotten it when I posted.
Dave
ID: 41153 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 41167 - Posted: 26 Nov 2010, 19:54:04 UTC

My third Famous unit is still alive after 81 hours.
Tullio
ID: 41167 · Report as offensive     Reply Quote
old_user588361

Send message
Joined: 10 Sep 09
Posts: 9
Credit: 253,090
RAC: 0
Message 41173 - Posted: 28 Nov 2010, 7:35:12 UTC

Just had one crash with a "computational error". Log says:

11/27/2010 10:31:25 PM climateprediction.net Output file famous_w3o9_599_200_006751449_0_7.zip for task famous_w3o9_599_200_006751449_0 absent

Nothing locked, it eventually cleared with the trickle up msg.
ID: 41173 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 41184 - Posted: 29 Nov 2010, 11:11:02 UTC

Yesterday I had two units crash at more or less the same time - seemingly on restarting BOINC.Both had a segmentation error. Presumably this is a problem with my system. I am assuming this is to do with having shut the system down without closing BOINC down first.
Dave
ID: 41184 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41189 - Posted: 30 Nov 2010, 9:04:16 UTC

If a computer is shut down or restarted without first exiting completely from Boinc, sooner or later models will crash. It may depend on whether they were writing to disk at the precise moment of the shutdown. Another danger moment is if models are interrupted while they're post-processing ie generating a file for upload.

Fortunately it only takes a moment to exit from Boinc (not just from the Boinc manager).
Cpdn news
ID: 41189 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 41287 - Posted: 15 Dec 2010, 0:58:29 UTC

Anyone know what causes this error? Or more specifically, is there anything I can do to prevent it?

I get this error when I have to restart my computer. I shutdown BOINC before rebooting.


<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Signal 15 received, exiting...
(1485): called boinc_finish
SIGSEGV: segmentation violation
Stack trace (7 frames):
../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu(boinc_catch_signal+0x58)[0x809e59c]
[0xb78cd400]
../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu[0x804f906]
../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu[0x805085a]
../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu[0x8050ad6]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6)[0xb7620bd6]
../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu(__gxx_personality_v0+0xe1)[0x804c449]

Exiting...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Signal 3 received, exiting...
(1562): called boinc_finish

</stderr_txt>
]]>

ID: 41287 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 41292 - Posted: 15 Dec 2010, 15:15:58 UTC - in response to Message 41287.  

Which computer is having this problem? It appears BOINC is not exiting properly when the system is shutting down.
ID: 41292 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 41295 - Posted: 15 Dec 2010, 23:52:43 UTC - in response to Message 41292.  

The computer number is 1110459.
The name is AMD4.
It has an AMD 9750 quad processor.
I am running Ubuntu LINUX.

I have had this problem before when I shut down (re-booted) the computer without terminating BOINC. So now I terminate BOINC first, then reboot. However, this does not seem to make any difference. When I reboot the CP models get a computational error.

Thanks for helping.

Bob
ID: 41295 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41296 - Posted: 16 Dec 2010, 9:22:11 UTC - in response to Message 41295.  

Are you shutting down both the gui (manager), AND the client?

I like to suspend each model in turn, after checking that it's not just about to check point, then suspend BOINC, and only then exit from BOINC.


Backups: Here
ID: 41296 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 41298 - Posted: 16 Dec 2010, 14:14:59 UTC - in response to Message 41295.  

For some reason, Ubuntu 10 does not stop BOINC when you reboot. I looked into it for a minute, but I can't figure out why. Other distros do. If you used the package manager to install BOINC, then you have to use this command to stop it before rebooting:

sudo /etc/init.d/boinc-client stop
ID: 41298 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 41301 - Posted: 16 Dec 2010, 17:58:18 UTC - in response to Message 41296.  

When I want to reboot, I shutdown only the manager. I just tried shutting down the Manager and according to the System Monitor all of tasks were stopped also. I restarted BOINC and everything is fine. However, I did not reboot.

I am not sure, but I suspect not all CP models get a Computational Error when I reboot.

Maybe there is a difference between stopping the models and then stooping the Manager as opposed to just shutting down the Manager and by default the models.

Maybe on some models the information isn't being check pointed correctly to enable a restart.

Maybe there is a defect in the checkpoint restart logic on some models.

Or maybe ???
ID: 41301 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41302 - Posted: 16 Dec 2010, 18:56:51 UTC - in response to Message 41301.  

All of the climate models checkpoint at a fixed point in time. (Where "time" is during the model's life.)
They do NOT checkpoint when shut down. When re-started, they do so from the last saved checkpoint, re-processing all of the data which may have been done previously but discarded.

The reason for shut down failures is due to the large number of files that they have open during the running. All of these need to be shut down cleanly, otherwise they can be corrupted, leaving the checkpoint/restart mechanism without clean data to start up again.
Sometimes you get away with it, sometimes you don't.

As you're having occasional failures, try at least menu-Suspending BOINC, waiting until it stops all of the models and SAYS Suspended, and THEN menu-Exit from BOINC.


Backups: Here
ID: 41302 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 41303 - Posted: 16 Dec 2010, 19:07:14 UTC - in response to Message 41302.  

I will try the "Suspend" route next time I re-boot.

Thanks for the help.
ID: 41303 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 41345 - Posted: 23 Dec 2010, 15:46:10 UTC

My fourth Famous model crashed after 300 hours, but my wingman had crashed much before.
Tullio
ID: 41345 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Computing error

©2024 cpdn.org