climateprediction.net (CPDN) home page
Thread 'MAKING BACKUPS???'

Thread 'MAKING BACKUPS???'

Message boards : Number crunching : MAKING BACKUPS???
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 45324 - Posted: 15 Dec 2012, 5:10:12 UTC

I would just like to revisit the subject of making backups of the Hadcm3n Wu�s. Several people have stated that it is best to download the WU�s and just let them run, never exiting them to make backups. I think this is wrong.

The reason for this is that 3 days ago (with more than 400 hours invested in one of the Hadcm3n Wu�s) it reached the 50% mark and crashed. The computer was running unattended at that time and doesn�t seem to have been doing anything other than crunching the 2 models on the machine. I have taken the usual precautions suck as exempting Boinc from the AV scans and such.

According to the �messages� the last thing the WU did before crashing is send the second decadal zip file. I am sure that the WU crashed as it tried to restart the crunching of the third decade.

Fortunately, I still make backups. I was able to restore the WU from a backup that was only 2 days old and the WU is now past the point at which it crashed. That backup saved me from loosing 400 hours of crunching. I for one will continue to make backups.

P.S. Someone might want to talk the programmers about doing something about this tendency of the models to crash as they restart after sending the decadal zip files.

ID: 45324 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45325 - Posted: 15 Dec 2012, 7:06:51 UTC - in response to Message 45324.  

There appear to be several failure modes for that model type, but not all apply to each model. Not all of the 'whys and wherefores' are known.

But one of the ways to cause a failure is exiting to make a backup.
Been there, done that.

And the way around that is to watch as the computer is asking for work.
If it gets a hadcm3, then Suspend the model in the Tasks tab before all the files download. That way it can't start.
Then Exit and make a backup.
Start the model.

The backup from this process will take a loooong time to re-do if it's needed, but fairly guaranteed.

As for the programmers, THEY KNOW ABOUT THIS.
Some of the moderators have done extensive studies of the failures, but the real fix involves source code for files that the Met Office won't part with.
Work arounds have been thought up, and I think that tests are/were done in beta.

PS Remember when I said that 'things were going on'?
And no, I'm NOT going to start blabbing.


Backups: Here
ID: 45325 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 45326 - Posted: 15 Dec 2012, 8:00:55 UTC - in response to Message 45324.  
Last modified: 15 Dec 2012, 8:04:26 UTC

Can't remember who but I read recently from one of the moderators that they run these tasks in a machine not connected to the internet until they are completed at which point all the zips can be uploaded to the server at once. I don't have this computer running 24/7 so it makes sense to take backups and I have on one occasion had a hadam3cn carry on successfully past the point where it failed at either the first or second quartile. Others have gone on to fail at the same point.

Edit: And Just realised Les has covered everything else I said much better.
ID: 45326 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 45329 - Posted: 15 Dec 2012, 18:43:19 UTC - in response to Message 45326.  

Can't remember who but I read recently from one of the moderators that they run these tasks in a machine not connected to the internet until they are completed at which point all the zips can be uploaded to the server at once. ...

That may have been me. The reason for running off-line is precisely so that the model can be re-run from the 't=0' backup Les describes. Having originally taken backups the normal way and realised that the backups were causing crashes, I convinced myself that the restored model was not producing the same results as if the model had been allowed to continue. (I checked this extensively.) The method does not guarantee complete immunity from crashes, but does reduce those crashes to an irreducible minimum.

I do not for a moment recommend this method to anyone as it is wasteful when running multiple projects or cores. I only run CPDN, don't mind the waste, and only now run one HADCM3N on machines where 'one at a time for a long time' is possible.

Oh yes: it also makes getting work in the current queue-less project state almost impossible as you need to watch for a model download. Horses for courses.
ID: 45329 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,610,380
RAC: 3,377
Message 45338 - Posted: 16 Dec 2012, 22:25:38 UTC

My tasks only crash when they crash on other people's computers as well. Since I found that restoring a backup creates duplicate computer ID's on all projects on my computer, I don't make backups anymore.
ID: 45338 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 45339 - Posted: 17 Dec 2012, 0:46:41 UTC - in response to Message 45338.  

My tasks only crash when they crash on other people's computers as well. Since I found that restoring a backup creates duplicate computer ID's on all projects on my computer, I don't make backups anymore.

The model type being considered here is HADCM3N, which is particularly vulnerable at the decade points, of which there are four given that it is a 40 year model. The computer linked to the account from which you posted has run only one of those models: happily it succeeded, though two other computers failed to finish in that work unit, including one failure at the 30 year point.

Restoring a backup requires care if a clean record is to be preserved, whether attached to one project or more than one. However, it can be done with a little editing of the project files. BOINC do not themselves support backup in the client, since the projects should write software that can recover from a saved checkpoint. That checkpointing process, present in CPDN, is not always enough protection from errors that may occur. Though it is frustrating for a project such as CPDN that has long runs, I suspect the BOINC position is a sensible one: work unit management on the server side would be impossible if backups could be restored at any time.

PS The queue crept up to six HADCM3N and I managed to get one ...
ID: 45339 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 45340 - Posted: 17 Dec 2012, 2:37:21 UTC - in response to Message 45339.  

I've discovered that a Solaris Virtual Machine running on my Linux box can be configured to make automatic backups of my home directory. Unfortunately, I found only a BOINC client and a SETI@home app by a developer called Dotsch to run on my Solaris 11.1.
Tullio
ID: 45340 · Report as offensive     Reply Quote
ProfileJoe's Climate
Avatar

Send message
Joined: 10 Dec 11
Posts: 11
Credit: 253,758
RAC: 3
Message 45414 - Posted: 6 Jan 2013, 22:59:07 UTC - in response to Message 45340.  

Reading through this thread, I think there would likely be some problems trying to backup BOINC while it's running similar to issues with trying to backup an actively running database. Has anyone tried to stop boinc (when you get to the point that you are going to backup the boinc directories), then backup, then restart boinc (after you've backed-up the boinc directories).

...reading through the rest of the thread, I guess it would still have the problem with multiple IDs appearing if you're connected to the internet, so you still have that problem there.

Like JIM indicates, it's a bit of a shame to have 400hrs of time go to waste, but then again, I lean more towards it's wasted and gone, versus making myself extra work to backup and recall, then redo a WU just to get back that wasted 400hrs.

Thanks for the warnings about HADCM3N around the decade mark.
Joe
ID: 45414 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 45415 - Posted: 7 Jan 2013, 0:46:59 UTC

I still think backups are very important. I just recently finished a CM that had more than 600 hours invested in it and only 62 hours to go when it crashed. The reason was that the power failed for about 5 seconds due to a storm and the computer when down. That�s all it took, the lights flickering on and off for a few seconds.

When I rebooted the computer, the hadam3p WU running on the machine was just fine, but, the CM had crashed. Fortunately, I had made a backup only about 8 hours earlier. After doing a restore both WU�s finished just fine. 600 hours saved. I would not run the hadcm3n models without making backups. There are to many things that can go wrong.

ID: 45415 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 45417 - Posted: 7 Jan 2013, 9:06:24 UTC

Like Jim, I have backed up CM3 tasks and on one occasion finished after restarting with the backup. I can understand however that for many users it is more trouble than it is worth, Especially as more often than not if the backup is used to try and get a model that has crashed at a decade point to run to completion it crashes at the same decade point. On the other hand it can be useful for a power outage crash.
ID: 45417 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 45422 - Posted: 8 Jan 2013, 6:40:36 UTC - in response to Message 45417.  

Like Jim, I have backed up CM3 tasks and on one occasion finished after restarting with the backup. I can understand however that for many users it is more trouble than it is worth, Especially as more often than not if the backup is used to try and get a model that has crashed at a decade point to run to completion it crashes at the same decade point. On the other hand it can be useful for a power outage crash.


Good point. The CM3's that crash at decade point - are "usually" not worth the trouble to restart.

The other good point is that if you are going to do a backup you need a REALLY clean shutdown. Like suspend - then stop the manager - then stop the service - and check no files open after each step.

Even if your wu has reported failure -- after a restart they usually go on to complete and give credit with some odd status - worth doing iff you think the restart might complete -- if it doesn't no problem.

These CPDN wu so huge -- totally different from most BOINC projects - but worth the trouble I think.


ID: 45422 · Report as offensive     Reply Quote

Message boards : Number crunching : MAKING BACKUPS???

©2024 cpdn.org