Message boards : Number crunching : TIME FOR BACKUPS AGAIN?
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Now that we are seeing the return of longer (90+ days) models I think it is time to revisit the subject of making periodic backups. Three months is a long time. Running a model for that long is a big commitment of time and resources. Loosing a model that you have been running for several weeks or month is hard. Things happen. Power failures, freeze-ups, and unexpected reboots (Win10) are only some of the things that can cause good WU to fail. A recent backup gives you a second chance. Backup are easy and quick to make. If you don’t know how to make one there are instructions in a thread in the number crunching section. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
In days of yore, we ran Coupled Models of the same kind on boxes with, typically, two cores. Model types could be managed, some allowed, others not allowed. We do not have the capability to micro-manage among the several types and lengths of most work offered -- except setting 'No New Tasks' after one is caught. That is a severe limitation on machines with 4+ cores. I'm open to correction if I've overlooked the obvious. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The longest model that I've had recently was about 3 weeks. (My current models are going to run for a whole 2 days. :) ) My computers appear to be very stable, with no model failures, even from a power failure. The only problem recently has been failing hard disks, and that was gradual. Except for one, where the HP system insisted that the HD be replaced at once. And then I needed a disk with the OS (Win 7), on it, which I didn't have, so the backup/recovery disk that I'd made was useless. The last time that I made/used backups, was when we ran the "Slab models". As Astro said, things were simpler then. If anyone does go for backups, perhaps they could make some posts about how it's working out. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
If I get any really long models on Linux I will think about backups as I have in the past often seen tasks fail after a computer restart which is combined with a kernel update. It has happened often enough to make me think it is significant so going back would I suspect have to be combined with my choosing to run the older kernel till the task(s) in question finish. With shorter tasks I just wait till I have no work and perform the update which seems to happen regularly enough with Linux. Might try doing a backup, let models finish and then with interweb access disabled try the backup when out of work just for interest. Then I can see if there are any problems before I need it? |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
If I get any really long models on Linux I will think about backups as I have in the past often seen tasks fail after a computer restart which is combined with a kernel update. It has happened often enough to make me think it is significant so going back would I suspect have to be combined with my choosing to run the older kernel till the task(s) in question finish. With shorter tasks I just wait till I have no work and perform the update which seems to happen regularly enough with Linux. My machine does backups every morning (as in 3AM) of "everything." Daily to a 1 Terabyte external hard drive, and weekly to magnetic tape. I have been running ClimatePrediction since 5 Aug 2004 on four different machines; at one point I was running three machines at once, but now just one. The oldest one was a Pentium with two hard drives. The next one had two Pentium III chips and two 10,000 rps SCSI hard drives. The next one had two 3.06GHz hyperthreaded Xeon chips and six 10,000 rpm SCSI hard drives. These machines all ran one version or another of Red Hat Linux. They ran 24/7 for about 10 years each. They have all had APC uninterruptable power supplies. In all that time, I have had only one hard drive fail. One machine failed when power came back on after the cleanup from storm Sandy was completed. The power flicked on and off in very rapid succession and the UPS could not manage that. I think I lost some work units then because I had to buy a new computer and I installed a more up-to-date version of Linux then. Since I seldom get cpdn work units anymore, and those I do get are short, it no longer matters. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
[quote]The longest model that I've had recently was about 3 weeks. (My current models are going to run for a whole 2 days. :) ) Right now I have 5 models running on 2 different machines (all batch 764) that are estimated to run more than 100 days each. At least both are running Win7. That is a pretty stable OS and I can control the update times and reboots. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Jim If the 100 hours is what BOINC is saying, what do you get with a manual calc using Progress %, and Elapsed time? |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
If the 100 hours is what BOINC is saying, what do you get with a manual calc using Progress %, and Elapsed time? My biggest discreapancy is on a sam25 at the moment. 69.5% complete in 18 days 14 hours and estimating 31 days six hours to go. Can't remember now what the initial estimate was exactly but I seem to remember one being well over 100 days. That is on WINE which misreports the gflops of the computer for some reason that I have probably not learned enough to understand yet so for a new model type I halve the estimate to get something closer to reality. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Hi Jim The models seem to be progressing at about 0.9% per day. At 19d 18h it is 13.283% complete. That means that the initial estimate of 102 days is actually low. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
5 months. Not good. I've just raised the issue with management. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
OK, it'll be looked into in the next day or so. For now, there was no intention of making run times this long. For Linux + Wine, it looks like just a matter of digging down in the C:\ drive to the BOINC folder, and copying 4+ Gigs to an external drive. In theory. :) |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
For now, there was no intention of making run times this long. I was just thinking that an i7-9700K (Coffee Lake 8-Core, and being full cores) would be a very nice CPU for this project. But they should hopefully tell us what they have in mind, so we can plan accordingly. Not all projects are for all machines anyway. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
These very long WU’s might be fine for the desktop machine with 4.0 GHz processors, but for 2.66 GHz laptops they are a problem. Most of the people running this project these days didn’t sign up of models that take nearly half a year. I’m an old timer running CP since the days of the BBC experiment so I do know about long models. One thing I do know is that with models that long it is a very good idea to make backups. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Back in the old days, one could choose which model types a PC could run under preferences. But since almost everything is under the wah2 app now, you might get a 2 CPU day or a 60 CPU day model depending on the region and model months run. And you have no control over it. Of course running full up on a hyperthreaded/SMT PC really slows down individual model completion times compared to not running that way. On short models that's not such a big deal. On long models it really stretches it out. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
For Linux + Wine, it looks like just a matter of digging down in the C:\ drive to the BOINC folder, and copying 4+ Gigs to an external drive. As you say, "In Theory." The interesting bit which I can't remember is if one out of four plus tasks crashes is how to restore just the crashed one to a pre-crash state. I vaguely remember also that there can be problems if the restore point is prior to a trickle up. |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
When running UNIX Version 5 on an Onyx Computer in the years 1980-85 we made incremental backups every week because Winchester disks were very fragile. We made backups on tape with the "dump" and "restore" commands from Berkeley UNIX. All this is just history. Tullio |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
We made backups on tape with the "dump" and "restore" commands from Berkeley UNIX. All this is just history. Almost as old is what I do now. I still have a VXA-2 tape drive (and a spare on the shelf) on my desktop. I back up with this in my cron.weekly file. You can guess what is in the variables. # Let us pick block size. $MT -f $TAPE_DRIVE setblk 0 $FIND $FILES -xdev -print | $CPIO $CPIO_O_OPTIONS > $TAPE_DRIVE 2>> $REPORT I do something similarl daily onto an external 1 TeraByte hard drive. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
For Linux + Wine, it looks like just a matter of digging down in the C:\ drive to the BOINC folder, and copying 4+ Gigs to an external drive. I don’t think that in Windows it is practical to restore just one model from a backup. You just have to bite the bullet and accept the fact that all running models will be put back to the point that they were when the backup was made. That’s why it is important to make frequent backups. I make one every two or three days. Trickles have never been a problem. Duplicate trickles are just rejected. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
I don’t think that in Windows it is practical to restore just one model from a backup. I thought it could be done with some editing of client_state.xml but I don't remember ever seeing the details of how to do it posted. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I don’t think that in Windows it is practical to restore just one model from a backup. I think that was way back when the BBC experiment was running, and that it was on the now defunct PHP server. There are two parts to restoring: 1) Get the last contact number from the current client_state.xml, and put it into the old backup. (This stops BOINC from thinking that this is coming from a new computer.) 2) Do a massive amount of client_state.xml editing to isolate the task that's being restored. I never had to do the second part way back when the log runs were around, so I don't know what would be needed. Especially as now a lot of people run multiple projects on computers with large numbers of processors. A lot of recent sign-ups have 32 processor machines, and there is a 64 processor machine that's been crashing tasks for some time. And if a procedure was going to be posted here, it would have to allow for these people. Unless the advice was "only for experienced people", in which case I think that it would be simpler to just restore everything, and only do point 1) above. And if there ARE lots of failures due to the long run times, then perhaps it will get the researchers thinking about ways to make future models shorter. So that works too. |
©2024 cpdn.org