climateprediction.net (CPDN) home page
Thread 'TIME FOR BACKUPS AGAIN?'

Thread 'TIME FOR BACKUPS AGAIN?'

Message boards : Number crunching : TIME FOR BACKUPS AGAIN?
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 59139 - Posted: 7 Dec 2018, 18:10:55 UTC

Now that we are seeing the return of longer (90+ days) models I think it is time to revisit the subject of making periodic backups. Three months is a long time. Running a model for that long is a big commitment of time and resources. Loosing a model that you have been running for several weeks or month is hard.

Things happen. Power failures, freeze-ups, and unexpected reboots (Win10) are only some of the things that can cause good WU to fail. A recent backup gives you a second chance. Backup are easy and quick to make. If you don’t know how to make one there are instructions in a thread in the number crunching section.
ID: 59139 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 59140 - Posted: 7 Dec 2018, 21:17:18 UTC - in response to Message 59139.  

In days of yore, we ran Coupled Models of the same kind on boxes with, typically, two cores. Model types could be managed, some allowed, others not allowed. We do not have the capability to micro-manage among the several types and lengths of most work offered -- except setting 'No New Tasks' after one is caught. That is a severe limitation on machines with 4+ cores.

I'm open to correction if I've overlooked the obvious.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 59140 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59141 - Posted: 8 Dec 2018, 1:38:32 UTC

The longest model that I've had recently was about 3 weeks.
(My current models are going to run for a whole 2 days. :) )

My computers appear to be very stable, with no model failures, even from a power failure.
The only problem recently has been failing hard disks, and that was gradual.
Except for one, where the HP system insisted that the HD be replaced at once.
And then I needed a disk with the OS (Win 7), on it, which I didn't have, so the backup/recovery disk that I'd made was useless.

The last time that I made/used backups, was when we ran the "Slab models".
As Astro said, things were simpler then.

If anyone does go for backups, perhaps they could make some posts about how it's working out.
ID: 59141 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 59142 - Posted: 8 Dec 2018, 9:02:51 UTC - in response to Message 59141.  

If I get any really long models on Linux I will think about backups as I have in the past often seen tasks fail after a computer restart which is combined with a kernel update. It has happened often enough to make me think it is significant so going back would I suspect have to be combined with my choosing to run the older kernel till the task(s) in question finish. With shorter tasks I just wait till I have no work and perform the update which seems to happen regularly enough with Linux.

Might try doing a backup, let models finish and then with interweb access disabled try the backup when out of work just for interest. Then I can see if there are any problems before I need it?
ID: 59142 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59143 - Posted: 8 Dec 2018, 15:22:20 UTC - in response to Message 59142.  

If I get any really long models on Linux I will think about backups as I have in the past often seen tasks fail after a computer restart which is combined with a kernel update. It has happened often enough to make me think it is significant so going back would I suspect have to be combined with my choosing to run the older kernel till the task(s) in question finish. With shorter tasks I just wait till I have no work and perform the update which seems to happen regularly enough with Linux.


My machine does backups every morning (as in 3AM) of "everything." Daily to a 1 Terabyte external hard drive, and weekly to magnetic tape. I have been running ClimatePrediction since 5 Aug 2004 on four different machines; at one point I was running three machines at once, but now just one. The oldest one was a Pentium with two hard drives. The next one had two Pentium III chips and two 10,000 rps SCSI hard drives. The next one had two 3.06GHz hyperthreaded Xeon chips and six 10,000 rpm SCSI hard drives. These machines all ran one version or another of Red Hat Linux. They ran 24/7 for about 10 years each. They have all had APC uninterruptable power supplies.

In all that time, I have had only one hard drive fail. One machine failed when power came back on after the cleanup from storm Sandy was completed. The power flicked on and off in very rapid succession and the UPS could not manage that. I think I lost some work units then because I had to buy a new computer and I installed a more up-to-date version of Linux then.

Since I seldom get cpdn work units anymore, and those I do get are short, it no longer matters.
ID: 59143 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 59144 - Posted: 8 Dec 2018, 16:33:15 UTC - in response to Message 59141.  

[quote]The longest model that I've had recently was about 3 weeks.
(My current models are going to run for a whole 2 days. :) )

Right now I have 5 models running on 2 different machines (all batch 764) that are estimated to run more than 100 days each. At least both are running Win7. That is a pretty stable OS and I can control the update times and reboots.
ID: 59144 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59145 - Posted: 8 Dec 2018, 19:53:41 UTC

Hi Jim

If the 100 hours is what BOINC is saying, what do you get with a manual calc using Progress %, and Elapsed time?
ID: 59145 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 59146 - Posted: 9 Dec 2018, 9:47:16 UTC - in response to Message 59145.  

If the 100 hours is what BOINC is saying, what do you get with a manual calc using Progress %, and Elapsed time?


My biggest discreapancy is on a sam25 at the moment. 69.5% complete in 18 days 14 hours and estimating 31 days six hours to go. Can't remember now what the initial estimate was exactly but I seem to remember one being well over 100 days.

That is on WINE which misreports the gflops of the computer for some reason that I have probably not learned enough to understand yet so for a new model type I halve the estimate to get something closer to reality.
ID: 59146 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 59147 - Posted: 9 Dec 2018, 15:55:55 UTC - in response to Message 59145.  

Hi Jim

If the 100 hours is what BOINC is saying, what do you get with a manual calc using Progress %, and Elapsed time?


The models seem to be progressing at about 0.9% per day. At 19d 18h it is 13.283% complete. That means that the initial estimate of 102 days is actually low.
ID: 59147 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59148 - Posted: 9 Dec 2018, 19:44:27 UTC

5 months. Not good.
I've just raised the issue with management.
ID: 59148 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59149 - Posted: 9 Dec 2018, 22:06:39 UTC

OK, it'll be looked into in the next day or so.
For now, there was no intention of making run times this long.

For Linux + Wine, it looks like just a matter of digging down in the C:\ drive to the BOINC folder, and copying 4+ Gigs to an external drive.
In theory. :)
ID: 59149 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59150 - Posted: 9 Dec 2018, 22:39:06 UTC - in response to Message 59149.  
Last modified: 9 Dec 2018, 22:41:31 UTC

For now, there was no intention of making run times this long.

I was just thinking that an i7-9700K (Coffee Lake 8-Core, and being full cores) would be a very nice CPU for this project.
But they should hopefully tell us what they have in mind, so we can plan accordingly. Not all projects are for all machines anyway.
ID: 59150 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 59151 - Posted: 10 Dec 2018, 5:04:10 UTC - in response to Message 59148.  

These very long WU’s might be fine for the desktop machine with 4.0 GHz processors, but for 2.66 GHz laptops they are a problem. Most of the people running this project these days didn’t sign up of models that take nearly half a year. I’m an old timer running CP since the days of the BBC experiment so I do know about long models. One thing I do know is that with models that long it is a very good idea to make backups.
ID: 59151 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 59152 - Posted: 10 Dec 2018, 5:09:46 UTC

Back in the old days, one could choose which model types a PC could run under preferences. But since almost everything is under the wah2 app now, you might get a 2 CPU day or a 60 CPU day model depending on the region and model months run. And you have no control over it. Of course running full up on a hyperthreaded/SMT PC really slows down individual model completion times compared to not running that way. On short models that's not such a big deal. On long models it really stretches it out.
ID: 59152 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 59153 - Posted: 10 Dec 2018, 8:16:31 UTC - in response to Message 59149.  

For Linux + Wine, it looks like just a matter of digging down in the C:\ drive to the BOINC folder, and copying 4+ Gigs to an external drive.


As you say, "In Theory." The interesting bit which I can't remember is if one out of four plus tasks crashes is how to restore just the crashed one to a pre-crash state. I vaguely remember also that there can be problems if the restore point is prior to a trickle up.
ID: 59153 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 59154 - Posted: 10 Dec 2018, 20:12:14 UTC

When running UNIX Version 5 on an Onyx Computer in the years 1980-85 we made incremental backups every week because Winchester disks were very fragile. We made backups on tape with the "dump" and "restore" commands from Berkeley UNIX. All this is just history.
Tullio
ID: 59154 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59155 - Posted: 10 Dec 2018, 20:44:39 UTC - in response to Message 59154.  

We made backups on tape with the "dump" and "restore" commands from Berkeley UNIX. All this is just history.


Almost as old is what I do now. I still have a VXA-2 tape drive (and a spare on the shelf) on my desktop. I back up with this in my cron.weekly file. You can guess what is in the variables.

# Let us pick block size.
$MT -f $TAPE_DRIVE setblk 0

$FIND $FILES -xdev -print | $CPIO $CPIO_O_OPTIONS > $TAPE_DRIVE 2>> $REPORT

I do something similarl daily onto an external 1 TeraByte hard drive.
ID: 59155 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 59156 - Posted: 11 Dec 2018, 5:03:19 UTC - in response to Message 59153.  

For Linux + Wine, it looks like just a matter of digging down in the C:\ drive to the BOINC folder, and copying 4+ Gigs to an external drive.


As you say, "In Theory." The interesting bit which I can't remember is if one out of four plus tasks crashes is how to restore just the crashed one to a pre-crash state. I vaguely remember also that there can be problems if the restore point is prior to a trickle up.



I don’t think that in Windows it is practical to restore just one model from a backup. You just have to bite the bullet and accept the fact that all running models will be put back to the point that they were when the backup was made. That’s why it is important to make frequent backups. I make one every two or three days. Trickles have never been a problem. Duplicate trickles are just rejected.
ID: 59156 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 59157 - Posted: 11 Dec 2018, 7:44:01 UTC - in response to Message 59156.  
Last modified: 11 Dec 2018, 7:47:53 UTC

I don’t think that in Windows it is practical to restore just one model from a backup.


I thought it could be done with some editing of client_state.xml but I don't remember ever seeing the details of how to do it posted.
ID: 59157 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59199 - Posted: 18 Dec 2018, 21:15:14 UTC - in response to Message 59157.  

I don’t think that in Windows it is practical to restore just one model from a backup.


I thought it could be done with some editing of client_state.xml but I don't remember ever seeing the details of how to do it posted.


I think that was way back when the BBC experiment was running, and that it was on the now defunct PHP server.

There are two parts to restoring:

1) Get the last contact number from the current client_state.xml, and put it into the old backup. (This stops BOINC from thinking that this is coming from a new computer.)

2) Do a massive amount of client_state.xml editing to isolate the task that's being restored.

I never had to do the second part way back when the log runs were around, so I don't know what would be needed.
Especially as now a lot of people run multiple projects on computers with large numbers of processors.
A lot of recent sign-ups have 32 processor machines, and there is a 64 processor machine that's been crashing tasks for some time.

And if a procedure was going to be posted here, it would have to allow for these people.
Unless the advice was "only for experienced people", in which case I think that it would be simpler to just restore everything, and only do point 1) above.

And if there ARE lots of failures due to the long run times, then perhaps it will get the researchers thinking about ways to make future models shorter.
So that works too.
ID: 59199 · Report as offensive     Reply Quote

Message boards : Number crunching : TIME FOR BACKUPS AGAIN?

©2024 cpdn.org