climateprediction.net home page
Model crashed with 1 trickle to go

Model crashed with 1 trickle to go

Questions and Answers : Unix/Linux : Model crashed with 1 trickle to go
Message board moderation

To post messages, you must log in.

AuthorMessage
dajashby

Send message
Joined: 1 Sep 04
Posts: 55
Credit: 17,223,688
RAC: 967
Message 14314 - Posted: 12 Jul 2005, 22:03:28 UTC

The result ended with this stderr-out.

4.13
process got signal 11

3
11


I believe this occured while the project was down on the 10th or 11th of July. I don\'t know how to tell whether the model actually completed. Is there any way we can rescue this? The computer concerned is currently running a newly downloaded model.

Running boinc 4.13 (yes, I know, I should upgrade ...)
Host no is 73339
Model name is 27cq_300123887_0 (Result 720418)


ID: 14314 · Report as offensive     Reply Quote
dajashby

Send message
Joined: 1 Sep 04
Posts: 55
Credit: 17,223,688
RAC: 967
Message 14315 - Posted: 12 Jul 2005, 22:08:20 UTC

When I posted, the mark up codes got left out:

"4.13"
"process got signal 11"
""
"3"
"11"
""

""

I hope this works...
ID: 14315 · Report as offensive     Reply Quote
dajashby

Send message
Joined: 1 Sep 04
Posts: 55
Credit: 17,223,688
RAC: 967
Message 14316 - Posted: 12 Jul 2005, 22:13:25 UTC

Um, no it didn't...
Do I have to retype this, or can you work it out?

core_client_version - 4.13
message - process got signal 11

active_task_state - 3
signal - 11

Bloody computers, I hate them...
ID: 14316 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 14317 - Posted: 12 Jul 2005, 22:54:23 UTC

To "save" the crashed model, you'll need to have a backup of the BOINC folder from before the crash.
If so, and you also want to save the current model, suspend BOINC, backup the BOINC folder somewhere by moving it, copy the old folder back to where it resides, possibly under Programs, then reboot. BOINC should then restart, along with hadsm etc.

BEFORE this model finishes, you will need to prevent the download of yet another parameter set.
So go into your General preferences and set "Leave at least" to a number way in excess of your hard disk size. Then do an Update so that the server tells your computer about the changes.

After you have finished the model and uploaded the 5 zips, assuming that it doesn't fail again, save the remaining model data, (to a cd / dvd if necessary), delete the BOINC folder, and copy back the model on which you are currently working. Then reboot.

At some point after this, you can reset the "Leave at least" to what it is now.


ID: 14317 · Report as offensive     Reply Quote
dajashby

Send message
Joined: 1 Sep 04
Posts: 55
Credit: 17,223,688
RAC: 967
Message 14318 - Posted: 12 Jul 2005, 23:47:06 UTC - in response to Message 14317.  

Les,

Thanks for the advice. As luck would have it, I do have a backup of the folder on that machine - it's nearly two weeks old, but that's better than nothing, I guess. (I recently switched Linux distros on the computer, which is why I have the backup.)

It occurs to me to wonder exactly what I need to backup to capture the current state of my climate prediction project, if I were to do this on a regular basis? I mean, I have 5 computers running boinc, amounting to gigabytes on gigabytes of data...

Derrick
ID: 14318 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 14319 - Posted: 13 Jul 2005, 0:26:53 UTC

After a model completes, SOME of the data, (about 7 Megs, I think), is uploaded for the researchers to examine. About 330 Megs remains on your computer, where you can examine it in detail with cpview or the Advanced Viz. If the researchers find your data interesting, they can get you to send them the rest. Not sure how.
But there is no need for completed models to remain in the BOINC folder. If you move them to outside it, there is a lot less to backup each time. You can also move them to external storage, which is what I do.

Because of the complex way that bits of data about a model is stored, (xml files, zip files, slots), if you backup at all, you should do so for the ENTIRE BOINC folder, and ALL it's sub-folders.

Life gets interesting if you are running multiple projects on ALL of your computers. I only run CPDN, so I can't advise on the best way to handle it.
Except, perhaps, to have one machine just for CPDN.

As to when to backup, perhaps a trickle or two before phase change, which is when some people seem to be getting crashes.

ID: 14319 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 14320 - Posted: 13 Jul 2005, 0:34:26 UTC

segmentation violation :-/

You could try to finish it on a different computer, that worked for me on one windows PC that didn't get past trickle24.

Maybe a windows boot and finishing the last few timesteps with windows BOINC/CPDN would work too.

If there is a coredump of the crash, that might be helpful for the developers.
ID: 14320 · Report as offensive     Reply Quote
dajashby

Send message
Joined: 1 Sep 04
Posts: 55
Credit: 17,223,688
RAC: 967
Message 14322 - Posted: 13 Jul 2005, 2:03:41 UTC - in response to Message 14319.  

Les,

Thanks again. I backed up the old models to cd, backed up the newly downloaded model to a safe place, and restored my backup of the crashed one. I have to redo about 100,000 timesteps, which is about a week on this PIII. Still, I think this is better than losing the whole model. At the same time I installed the 4.43 boinc client, and it is telling me that the computer is "overcommitted", followed by "nearly overcommitted". I saw a post on the forums to the effect that this is due to the client miscalculating time to completion, but is only a problem if you run multiple projects. Like you, I only run CPDN. Is this a problem, I wonder?

Derrick
ID: 14322 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 14323 - Posted: 13 Jul 2005, 4:23:29 UTC

The newer versions of BOINC have become complicated. <a href="http://boinc.berkeley.edu/sched.php"> This</a> page from the Berkeley site explains about scheduling.
I wouldn't worry too much. It will settle down when it's been running for a bit.
The message is mainly for multi-project computers, and just means that it's got it's hands full, and won't be downloading any more work units for a while.

ID: 14323 · Report as offensive     Reply Quote
dajashby

Send message
Joined: 1 Sep 04
Posts: 55
Credit: 17,223,688
RAC: 967
Message 14324 - Posted: 13 Jul 2005, 6:48:06 UTC - in response to Message 14323.  

&gt; The newer versions of BOINC have become complicated. <a> href="http://boinc.berkeley.edu/sched.php"&gt; This</a> page from the Berkeley
&gt; site explains about scheduling.

Yeah, that explained things really clearly. I'm actually running the Sulphur model beta as well as standard CPDN on my P4 Windows box. I think I almost understand what's going on.

&gt; I wouldn't worry too much. It will settle down when it's been running for a
&gt; bit.

Les, I'll take your word for it.


ID: 14324 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 14572 - Posted: 21 Jul 2005, 16:59:14 UTC

I had the same problem on my A64 with "signal 11" at phase shift between 2 and 3 and did four retries from backups without any luck.
Since then the machine has done a successful run so I think the box is OK.
ID: 14572 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Model crashed with 1 trickle to go

©2024 cpdn.org