Model crashed with 1 trickle to go

Author	Message
dajashby Send message Joined: 1 Sep 04 Posts: 55 Credit: 17,223,688 RAC: 967	Message 14314 - Posted: 12 Jul 2005, 22:03:28 UTC The result ended with this stderr-out. 4.13 process got signal 11 3 11 I believe this occured while the project was down on the 10th or 11th of July. I don\'t know how to tell whether the model actually completed. Is there any way we can rescue this? The computer concerned is currently running a newly downloaded model. Running boinc 4.13 (yes, I know, I should upgrade ...) Host no is 73339 Model name is 27cq_300123887_0 (Result 720418) ID: 14314 · Reply Quote

dajashby Send message Joined: 1 Sep 04 Posts: 55 Credit: 17,223,688 RAC: 967	Message 14315 - Posted: 12 Jul 2005, 22:08:20 UTC When I posted, the mark up codes got left out: "4.13" "process got signal 11" "" "3" "11" "" "" I hope this works... ID: 14315 · Reply Quote

dajashby Send message Joined: 1 Sep 04 Posts: 55 Credit: 17,223,688 RAC: 967	Message 14316 - Posted: 12 Jul 2005, 22:13:25 UTC Um, no it didn't... Do I have to retype this, or can you work it out? core_client_version - 4.13 message - process got signal 11 active_task_state - 3 signal - 11 Bloody computers, I hate them... ID: 14316 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 14317 - Posted: 12 Jul 2005, 22:54:23 UTC To "save" the crashed model, you'll need to have a backup of the BOINC folder from before the crash. If so, and you also want to save the current model, suspend BOINC, backup the BOINC folder somewhere by moving it, copy the old folder back to where it resides, possibly under Programs, then reboot. BOINC should then restart, along with hadsm etc. BEFORE this model finishes, you will need to prevent the download of yet another parameter set. So go into your General preferences and set "Leave at least" to a number way in excess of your hard disk size. Then do an Update so that the server tells your computer about the changes. After you have finished the model and uploaded the 5 zips, assuming that it doesn't fail again, save the remaining model data, (to a cd / dvd if necessary), delete the BOINC folder, and copy back the model on which you are currently working. Then reboot. At some point after this, you can reset the "Leave at least" to what it is now. ID: 14317 · Reply Quote

dajashby Send message Joined: 1 Sep 04 Posts: 55 Credit: 17,223,688 RAC: 967	Message 14318 - Posted: 12 Jul 2005, 23:47:06 UTC - in response to Message 14317. Les, Thanks for the advice. As luck would have it, I do have a backup of the folder on that machine - it's nearly two weeks old, but that's better than nothing, I guess. (I recently switched Linux distros on the computer, which is why I have the backup.) It occurs to me to wonder exactly what I need to backup to capture the current state of my climate prediction project, if I were to do this on a regular basis? I mean, I have 5 computers running boinc, amounting to gigabytes on gigabytes of data... Derrick ID: 14318 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 14319 - Posted: 13 Jul 2005, 0:26:53 UTC After a model completes, SOME of the data, (about 7 Megs, I think), is uploaded for the researchers to examine. About 330 Megs remains on your computer, where you can examine it in detail with cpview or the Advanced Viz. If the researchers find your data interesting, they can get you to send them the rest. Not sure how. But there is no need for completed models to remain in the BOINC folder. If you move them to outside it, there is a lot less to backup each time. You can also move them to external storage, which is what I do. Because of the complex way that bits of data about a model is stored, (xml files, zip files, slots), if you backup at all, you should do so for the ENTIRE BOINC folder, and ALL it's sub-folders. Life gets interesting if you are running multiple projects on ALL of your computers. I only run CPDN, so I can't advise on the best way to handle it. Except, perhaps, to have one machine just for CPDN. As to when to backup, perhaps a trickle or two before phase change, which is when some people seem to be getting crashes. ID: 14319 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 14320 - Posted: 13 Jul 2005, 0:34:26 UTC segmentation violation :-/ You could try to finish it on a different computer, that worked for me on one windows PC that didn't get past trickle24. Maybe a windows boot and finishing the last few timesteps with windows BOINC/CPDN would work too. If there is a coredump of the crash, that might be helpful for the developers. ID: 14320 · Reply Quote

dajashby Send message Joined: 1 Sep 04 Posts: 55 Credit: 17,223,688 RAC: 967	Message 14322 - Posted: 13 Jul 2005, 2:03:41 UTC - in response to Message 14319. Les, Thanks again. I backed up the old models to cd, backed up the newly downloaded model to a safe place, and restored my backup of the crashed one. I have to redo about 100,000 timesteps, which is about a week on this PIII. Still, I think this is better than losing the whole model. At the same time I installed the 4.43 boinc client, and it is telling me that the computer is "overcommitted", followed by "nearly overcommitted". I saw a post on the forums to the effect that this is due to the client miscalculating time to completion, but is only a problem if you run multiple projects. Like you, I only run CPDN. Is this a problem, I wonder? Derrick ID: 14322 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 14323 - Posted: 13 Jul 2005, 4:23:29 UTC The newer versions of BOINC have become complicated. <a href="http://boinc.berkeley.edu/sched.php"> This</a> page from the Berkeley site explains about scheduling. I wouldn't worry too much. It will settle down when it's been running for a bit. The message is mainly for multi-project computers, and just means that it's got it's hands full, and won't be downloading any more work units for a while. ID: 14323 · Reply Quote

dajashby Send message Joined: 1 Sep 04 Posts: 55 Credit: 17,223,688 RAC: 967	Message 14324 - Posted: 13 Jul 2005, 6:48:06 UTC - in response to Message 14323. > The newer versions of BOINC have become complicated. <a> href="http://boinc.berkeley.edu/sched.php"> This</a> page from the Berkeley > site explains about scheduling. Yeah, that explained things really clearly. I'm actually running the Sulphur model beta as well as standard CPDN on my P4 Windows box. I think I almost understand what's going on. > I wouldn't worry too much. It will settle down when it's been running for a > bit. Les, I'll take your word for it. ID: 14324 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 14572 - Posted: 21 Jul 2005, 16:59:14 UTC I had the same problem on my A64 with "signal 11" at phase shift between 2 and 3 and did four retries from backups without any luck. Since then the machine has done a successful run so I think the box is OK. ID: 14572 · Reply Quote