Message boards : Number crunching : Exited with zero status- reason to worry?
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 13 Oct 06 Posts: 60 Credit: 7,893 RAC: 0 |
Well, turned out it wasn\'t half as bad as I had expected, just a little copy and paste... didn\'t know it was that easy, my compliments to the developers who made the app so stable and easy to use, and lucky my backup was really brand-new so I didn\'t have problems with WUs from my other projects... |
Send message Joined: 3 Mar 06 Posts: 96 Credit: 353,185 RAC: 0 |
Glad that turned out OK, Annika :) Also, thanks to Richard Haselgrove for info on the NTS time updates. Question: how many backups should we keep? I mean how far back in time should they go? And if a model crashes, how far back should one go to find a backup that carries on past the crash point? I understand that some models are destined to crash because the initial model parameters eventually lead to ... what?... a divide by zero? an untenable situation? Whatevever it\'s called, it makes some sense to drop back to a backup that is 1 model year old and see if it will carry on. 1 model year doesn\'t take long to crunch so if it crashes again then you haven\'t wasted much time. Grab a new model and start fresh. Or do you? What if you\'re at 80% complete, it crashes, you drop back 1 model year and it crashes too? And you\'re not entirely convinced it was one of those \"destined to die\" models? Then you\'re tempted to drop back yet another model year and try again because it isn\'t easy to give up when you\'re about 80% done. Maybe it crashed because of a 1 time glitch in your \'puter and if you just go back far enough you\'ll get behind the glitch and be able to carry forward to 100%. But every year you drop back and try again is also time you could be crunching a new model. So how far back should one go in an effort to recover? At some point you have to kiss it goodbye and start a new model but where is that point? |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I keep a couple - just before I copy-and-paste the Boinc directory, I rename the old backup to BOINC_OLD. There\'s rarely a reason to go back more than a model decade, you\'d often be better off running a new model in order to make the best use of processing time. Keep in mind the restart dumps uploaded at 1960, 2000, and 2040 - if you got your model past one of those, the project itself now has a backup which can one day be restarted by someone else. Often it comes down to a personal question - how attached are you to THAT model, and how much time are you willing to invest in attempting to get it going again? For me, a week at most. I won\'t personally bother with a looper, the odds are too low, but if I thought it was recoverable I\'d put in the effort. (No partial models have been reissued yet, but the potential is there since the project has the dumps. When they have enough samples they may run a few variants with different environmental forcings from 2000-2080 for example). I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 13 Oct 06 Posts: 60 Credit: 7,893 RAC: 0 |
I\'ve understood that it is important with which error message the model crashes, so before deciding whether to use a backup I guess it\'s always a good idea to ask the experienced users or someone from the staff how the chances are. They probably have a good general idea which models are still worth the crunching time and which aren\'t... Personally, I don\'t think doing the work of weeks a second time sounds like a good idea, though I of course agree that it\'s frustrating to lose a model. Of course a few days are a different matter... but only if the model has chances to get through alright on the second attempt. But as I said, I\'d always ask for a second (and, if possible, 3rd and 4th...) opinion on that. Of course those concerns didn\'t apply in my case since BOINC or CPDN where in no way the reason that I had to use the backup (not sure what was, though, I suspect remains of an old antivirus program, perhaps combined with generally too many software changes and/or a tricky graphics card driver... anyway, Windows completely died on me...) , so the chances that the model would be OK after using the backup were no smaller than with any other model. Luckily the same procedures for creating and restoring the backups applied :-) and luckily there is such a thing as Linux Live CDs, or I would have had to use a three-day-old backup, don\'t even wanna think about all the other data on that HD... I recommend those to everyone here ;-) at least one for emergencies... Really nice for troubleshooting and making backups. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100 |
I suppose it depends on the circumstances surrounding the crash. Personally, if the machine had kept running smoothly; other BOINC projects were still OK; and there was nothing strange in the event log: then I\'d assume it was the model parameters and start a new one straight away. If, on the other hand, I was doing some hardware installation or a major software upgrade (my CPDN box is due to be my Vista testbed, so that\'s on the horizon), then I hope I would remember to take a special backup first. If CPDN died during the upgrade, I\'d assume it was my fault, and give it another try. Which leaves the situation where you get up in the morning and find that it\'s gone eerily quiet overnight..... then guidance from the admins about which models are worth resurrecting would be helpful. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Models not worth resurrecting: *Loopers after they\'ve looped 3 times so you\'re sure they really are loopers - see link #4 in the Crashes README http://www.climateprediction.net/board/viewforum.php?f=36&sid=5776557775d3aac2d2aee6a39588b452 *Models that crash with error messages including \'negative value created\', \'negative pressure created\' or similar. This can be assumed to mean that the model\'s initial parameter values were unviable. If you restore your backup, they\'ll crash at or about the same point again. But other -1, -107 and -161 error codes usually mean that there was some event or condition on the computer that caused the model to crash, and the cruncher should be able to avoid most of these events. Links #5 and #6 in the same README are full of useful ideas about keeping models safe. Cpdn news |
©2024 cpdn.org