Message boards : Number crunching : Computing error
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,829,455 RAC: 5,056 |
Tullio, That's a classic FAMOUS 'INVALID THETA DETECTED' crash: the model has six attempts at the end to get past the block and then gives up. FAMOUS models cover a long period of time and it seems that it's quite hard to find a set of parameters that will get a model through an entire run. The parameter set that your model had wasn't viable in the long term. Another one might be. Iain |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I seem to be getting a lot of models crashing with, "Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED." at the moment running linux with dual core intel. I had a break from it with 2 models completing but the past 2/3 have all crashed with this error. Dave |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It's all part of the experiment - start them off and see how far they get. FAMOUS is a bit different to previous experiments, in that some batches of them have extreme values for some of the variables. And those with start year of 599 are spinups; what they'll do is anyone's guess. But those that survive for the full run are then used to start a new series stretching off into the future. Backups: Here |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Thanks Les, I had read that somewhere but had forgotten it when I posted. Dave |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
My third Famous unit is still alive after 81 hours. Tullio |
Send message Joined: 10 Sep 09 Posts: 9 Credit: 253,090 RAC: 0 |
Just had one crash with a "computational error". Log says: 11/27/2010 10:31:25 PM climateprediction.net Output file famous_w3o9_599_200_006751449_0_7.zip for task famous_w3o9_599_200_006751449_0 absent Nothing locked, it eventually cleared with the trickle up msg. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Yesterday I had two units crash at more or less the same time - seemingly on restarting BOINC.Both had a segmentation error. Presumably this is a problem with my system. I am assuming this is to do with having shut the system down without closing BOINC down first. Dave |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
If a computer is shut down or restarted without first exiting completely from Boinc, sooner or later models will crash. It may depend on whether they were writing to disk at the precise moment of the shutdown. Another danger moment is if models are interrupted while they're post-processing ie generating a file for upload. Fortunately it only takes a moment to exit from Boinc (not just from the Boinc manager). Cpdn news |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Anyone know what causes this error? Or more specifically, is there anything I can do to prevent it? I get this error when I have to restart my computer. I shutdown BOINC before rebooting. <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... Signal 15 received, exiting... (1485): called boinc_finish SIGSEGV: segmentation violation Stack trace (7 frames): ../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu(boinc_catch_signal+0x58)[0x809e59c] [0xb78cd400] ../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu[0x804f906] ../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu[0x805085a] ../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu[0x8050ad6] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6)[0xb7620bd6] ../../projects/climateprediction.net/famous_6.11_i686-pc-linux-gnu(__gxx_personality_v0+0xe1)[0x804c449] Exiting... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Signal 3 received, exiting... (1562): called boinc_finish </stderr_txt> ]]> |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Which computer is having this problem? It appears BOINC is not exiting properly when the system is shutting down. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
The computer number is 1110459. The name is AMD4. It has an AMD 9750 quad processor. I am running Ubuntu LINUX. I have had this problem before when I shut down (re-booted) the computer without terminating BOINC. So now I terminate BOINC first, then reboot. However, this does not seem to make any difference. When I reboot the CP models get a computational error. Thanks for helping. Bob |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Are you shutting down both the gui (manager), AND the client? I like to suspend each model in turn, after checking that it's not just about to check point, then suspend BOINC, and only then exit from BOINC. Backups: Here |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
For some reason, Ubuntu 10 does not stop BOINC when you reboot. I looked into it for a minute, but I can't figure out why. Other distros do. If you used the package manager to install BOINC, then you have to use this command to stop it before rebooting: sudo /etc/init.d/boinc-client stop |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
When I want to reboot, I shutdown only the manager. I just tried shutting down the Manager and according to the System Monitor all of tasks were stopped also. I restarted BOINC and everything is fine. However, I did not reboot. I am not sure, but I suspect not all CP models get a Computational Error when I reboot. Maybe there is a difference between stopping the models and then stooping the Manager as opposed to just shutting down the Manager and by default the models. Maybe on some models the information isn't being check pointed correctly to enable a restart. Maybe there is a defect in the checkpoint restart logic on some models. Or maybe ??? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
All of the climate models checkpoint at a fixed point in time. (Where "time" is during the model's life.) They do NOT checkpoint when shut down. When re-started, they do so from the last saved checkpoint, re-processing all of the data which may have been done previously but discarded. The reason for shut down failures is due to the large number of files that they have open during the running. All of these need to be shut down cleanly, otherwise they can be corrupted, leaving the checkpoint/restart mechanism without clean data to start up again. Sometimes you get away with it, sometimes you don't. As you're having occasional failures, try at least menu-Suspending BOINC, waiting until it stops all of the models and SAYS Suspended, and THEN menu-Exit from BOINC. Backups: Here |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
I will try the "Suspend" route next time I re-boot. Thanks for the help. |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
My fourth Famous model crashed after 300 hours, but my wingman had crashed much before. Tullio |
©2024 cpdn.org