Message boards : Number crunching : MAKING BACKUPS???
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I would just like to revisit the subject of making backups of the Hadcm3n Wu�s. Several people have stated that it is best to download the WU�s and just let them run, never exiting them to make backups. I think this is wrong. The reason for this is that 3 days ago (with more than 400 hours invested in one of the Hadcm3n Wu�s) it reached the 50% mark and crashed. The computer was running unattended at that time and doesn�t seem to have been doing anything other than crunching the 2 models on the machine. I have taken the usual precautions suck as exempting Boinc from the AV scans and such. According to the �messages� the last thing the WU did before crashing is send the second decadal zip file. I am sure that the WU crashed as it tried to restart the crunching of the third decade. Fortunately, I still make backups. I was able to restore the WU from a backup that was only 2 days old and the WU is now past the point at which it crashed. That backup saved me from loosing 400 hours of crunching. I for one will continue to make backups. P.S. Someone might want to talk the programmers about doing something about this tendency of the models to crash as they restart after sending the decadal zip files. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There appear to be several failure modes for that model type, but not all apply to each model. Not all of the 'whys and wherefores' are known. But one of the ways to cause a failure is exiting to make a backup. Been there, done that. And the way around that is to watch as the computer is asking for work. If it gets a hadcm3, then Suspend the model in the Tasks tab before all the files download. That way it can't start. Then Exit and make a backup. Start the model. The backup from this process will take a loooong time to re-do if it's needed, but fairly guaranteed. As for the programmers, THEY KNOW ABOUT THIS. Some of the moderators have done extensive studies of the failures, but the real fix involves source code for files that the Met Office won't part with. Work arounds have been thought up, and I think that tests are/were done in beta. PS Remember when I said that 'things were going on'? And no, I'm NOT going to start blabbing. Backups: Here |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Can't remember who but I read recently from one of the moderators that they run these tasks in a machine not connected to the internet until they are completed at which point all the zips can be uploaded to the server at once. I don't have this computer running 24/7 so it makes sense to take backups and I have on one occasion had a hadam3cn carry on successfully past the point where it failed at either the first or second quartile. Others have gone on to fail at the same point. Edit: And Just realised Les has covered everything else I said much better. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
Can't remember who but I read recently from one of the moderators that they run these tasks in a machine not connected to the internet until they are completed at which point all the zips can be uploaded to the server at once. ... That may have been me. The reason for running off-line is precisely so that the model can be re-run from the 't=0' backup Les describes. Having originally taken backups the normal way and realised that the backups were causing crashes, I convinced myself that the restored model was not producing the same results as if the model had been allowed to continue. (I checked this extensively.) The method does not guarantee complete immunity from crashes, but does reduce those crashes to an irreducible minimum. I do not for a moment recommend this method to anyone as it is wasteful when running multiple projects or cores. I only run CPDN, don't mind the waste, and only now run one HADCM3N on machines where 'one at a time for a long time' is possible. Oh yes: it also makes getting work in the current queue-less project state almost impossible as you need to watch for a model download. Horses for courses. |
Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,610,380 RAC: 3,377 |
My tasks only crash when they crash on other people's computers as well. Since I found that restoring a backup creates duplicate computer ID's on all projects on my computer, I don't make backups anymore. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
My tasks only crash when they crash on other people's computers as well. Since I found that restoring a backup creates duplicate computer ID's on all projects on my computer, I don't make backups anymore. The model type being considered here is HADCM3N, which is particularly vulnerable at the decade points, of which there are four given that it is a 40 year model. The computer linked to the account from which you posted has run only one of those models: happily it succeeded, though two other computers failed to finish in that work unit, including one failure at the 30 year point. Restoring a backup requires care if a clean record is to be preserved, whether attached to one project or more than one. However, it can be done with a little editing of the project files. BOINC do not themselves support backup in the client, since the projects should write software that can recover from a saved checkpoint. That checkpointing process, present in CPDN, is not always enough protection from errors that may occur. Though it is frustrating for a project such as CPDN that has long runs, I suspect the BOINC position is a sensible one: work unit management on the server side would be impossible if backups could be restored at any time. PS The queue crept up to six HADCM3N and I managed to get one ... |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
I've discovered that a Solaris Virtual Machine running on my Linux box can be configured to make automatic backups of my home directory. Unfortunately, I found only a BOINC client and a SETI@home app by a developer called Dotsch to run on my Solaris 11.1. Tullio |
Send message Joined: 10 Dec 11 Posts: 11 Credit: 253,758 RAC: 3 |
Reading through this thread, I think there would likely be some problems trying to backup BOINC while it's running similar to issues with trying to backup an actively running database. Has anyone tried to stop boinc (when you get to the point that you are going to backup the boinc directories), then backup, then restart boinc (after you've backed-up the boinc directories). ...reading through the rest of the thread, I guess it would still have the problem with multiple IDs appearing if you're connected to the internet, so you still have that problem there. Like JIM indicates, it's a bit of a shame to have 400hrs of time go to waste, but then again, I lean more towards it's wasted and gone, versus making myself extra work to backup and recall, then redo a WU just to get back that wasted 400hrs. Thanks for the warnings about HADCM3N around the decade mark. Joe |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I still think backups are very important. I just recently finished a CM that had more than 600 hours invested in it and only 62 hours to go when it crashed. The reason was that the power failed for about 5 seconds due to a storm and the computer when down. That�s all it took, the lights flickering on and off for a few seconds. When I rebooted the computer, the hadam3p WU running on the machine was just fine, but, the CM had crashed. Fortunately, I had made a backup only about 8 hours earlier. After doing a restore both WU�s finished just fine. 600 hours saved. I would not run the hadcm3n models without making backups. There are to many things that can go wrong. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Like Jim, I have backed up CM3 tasks and on one occasion finished after restarting with the backup. I can understand however that for many users it is more trouble than it is worth, Especially as more often than not if the backup is used to try and get a model that has crashed at a decade point to run to completion it crashes at the same decade point. On the other hand it can be useful for a power outage crash. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Like Jim, I have backed up CM3 tasks and on one occasion finished after restarting with the backup. I can understand however that for many users it is more trouble than it is worth, Especially as more often than not if the backup is used to try and get a model that has crashed at a decade point to run to completion it crashes at the same decade point. On the other hand it can be useful for a power outage crash. Good point. The CM3's that crash at decade point - are "usually" not worth the trouble to restart. The other good point is that if you are going to do a backup you need a REALLY clean shutdown. Like suspend - then stop the manager - then stop the service - and check no files open after each step. Even if your wu has reported failure -- after a restart they usually go on to complete and give credit with some odd status - worth doing iff you think the restart might complete -- if it doesn't no problem. These CPDN wu so huge -- totally different from most BOINC projects - but worth the trouble I think. |
©2024 cpdn.org