Thread 'TIME FOR BACKUPS AGAIN?'

Author	Message
JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 59139 - Posted: 7 Dec 2018, 18:10:55 UTC Now that we are seeing the return of longer (90+ days) models I think it is time to revisit the subject of making periodic backups. Three months is a long time. Running a model for that long is a big commitment of time and resources. Loosing a model that you have been running for several weeks or month is hard. Things happen. Power failures, freeze-ups, and unexpected reboots (Win10) are only some of the things that can cause good WU to fail. A recent backup gives you a second chance. Backup are easy and quick to make. If you donâ€™t know how to make one there are instructions in a thread in the number crunching section. ID: 59139 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 59140 - Posted: 7 Dec 2018, 21:17:18 UTC - in response to Message 59139. In days of yore, we ran Coupled Models of the same kind on boxes with, typically, two cores. Model types could be managed, some allowed, others not allowed. We do not have the capability to micro-manage among the several types and lengths of most work offered -- except setting 'No New Tasks' after one is caught. That is a severe limitation on machines with 4+ cores. I'm open to correction if I've overlooked the obvious. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 59140 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59141 - Posted: 8 Dec 2018, 1:38:32 UTC The longest model that I've had recently was about 3 weeks. (My current models are going to run for a whole 2 days. :) ) My computers appear to be very stable, with no model failures, even from a power failure. The only problem recently has been failing hard disks, and that was gradual. Except for one, where the HP system insisted that the HD be replaced at once. And then I needed a disk with the OS (Win 7), on it, which I didn't have, so the backup/recovery disk that I'd made was useless. The last time that I made/used backups, was when we ran the "Slab models". As Astro said, things were simpler then. If anyone does go for backups, perhaps they could make some posts about how it's working out. ID: 59141 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4563 Credit: 19,039,635 RAC: 18,944	Message 59142 - Posted: 8 Dec 2018, 9:02:51 UTC - in response to Message 59141. If I get any really long models on Linux I will think about backups as I have in the past often seen tasks fail after a computer restart which is combined with a kernel update. It has happened often enough to make me think it is significant so going back would I suspect have to be combined with my choosing to run the older kernel till the task(s) in question finish. With shorter tasks I just wait till I have no work and perform the update which seems to happen regularly enough with Linux. Might try doing a backup, let models finish and then with interweb access disabled try the backup when out of work just for interest. Then I can see if there are any problems before I need it? ID: 59142 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 59143 - Posted: 8 Dec 2018, 15:22:20 UTC - in response to Message 59142. If I get any really long models on Linux I will think about backups as I have in the past often seen tasks fail after a computer restart which is combined with a kernel update. It has happened often enough to make me think it is significant so going back would I suspect have to be combined with my choosing to run the older kernel till the task(s) in question finish. With shorter tasks I just wait till I have no work and perform the update which seems to happen regularly enough with Linux. My machine does backups every morning (as in 3AM) of "everything." Daily to a 1 Terabyte external hard drive, and weekly to magnetic tape. I have been running ClimatePrediction since 5 Aug 2004 on four different machines; at one point I was running three machines at once, but now just one. The oldest one was a Pentium with two hard drives. The next one had two Pentium III chips and two 10,000 rps SCSI hard drives. The next one had two 3.06GHz hyperthreaded Xeon chips and six 10,000 rpm SCSI hard drives. These machines all ran one version or another of Red Hat Linux. They ran 24/7 for about 10 years each. They have all had APC uninterruptable power supplies. In all that time, I have had only one hard drive fail. One machine failed when power came back on after the cleanup from storm Sandy was completed. The power flicked on and off in very rapid succession and the UPS could not manage that. I think I lost some work units then because I had to buy a new computer and I installed a more up-to-date version of Linux then. Since I seldom get cpdn work units anymore, and those I do get are short, it no longer matters. ID: 59143 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 59144 - Posted: 8 Dec 2018, 16:33:15 UTC - in response to Message 59141. [quote]The longest model that I've had recently was about 3 weeks. (My current models are going to run for a whole 2 days. :) ) Right now I have 5 models running on 2 different machines (all batch 764) that are estimated to run more than 100 days each. At least both are running Win7. That is a pretty stable OS and I can control the update times and reboots. ID: 59144 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59145 - Posted: 8 Dec 2018, 19:53:41 UTC Hi Jim If the 100 hours is what BOINC is saying, what do you get with a manual calc using Progress %, and Elapsed time? ID: 59145 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4563 Credit: 19,039,635 RAC: 18,944	Message 59146 - Posted: 9 Dec 2018, 9:47:16 UTC - in response to Message 59145. If the 100 hours is what BOINC is saying, what do you get with a manual calc using Progress %, and Elapsed time? My biggest discreapancy is on a sam25 at the moment. 69.5% complete in 18 days 14 hours and estimating 31 days six hours to go. Can't remember now what the initial estimate was exactly but I seem to remember one being well over 100 days. That is on WINE which misreports the gflops of the computer for some reason that I have probably not learned enough to understand yet so for a new model type I halve the estimate to get something closer to reality. ID: 59146 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 59147 - Posted: 9 Dec 2018, 15:55:55 UTC - in response to Message 59145. Hi Jim If the 100 hours is what BOINC is saying, what do you get with a manual calc using Progress %, and Elapsed time? The models seem to be progressing at about 0.9% per day. At 19d 18h it is 13.283% complete. That means that the initial estimate of 102 days is actually low. ID: 59147 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59148 - Posted: 9 Dec 2018, 19:44:27 UTC 5 months. Not good. I've just raised the issue with management. ID: 59148 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59149 - Posted: 9 Dec 2018, 22:06:39 UTC OK, it'll be looked into in the next day or so. For now, there was no intention of making run times this long. For Linux + Wine, it looks like just a matter of digging down in the C:\ drive to the BOINC folder, and copying 4+ Gigs to an external drive. In theory. :) ID: 59149 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 59150 - Posted: 9 Dec 2018, 22:39:06 UTC - in response to Message 59149. Last modified: 9 Dec 2018, 22:41:31 UTC For now, there was no intention of making run times this long. I was just thinking that an i7-9700K (Coffee Lake 8-Core, and being full cores) would be a very nice CPU for this project. But they should hopefully tell us what they have in mind, so we can plan accordingly. Not all projects are for all machines anyway. ID: 59150 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 59151 - Posted: 10 Dec 2018, 5:04:10 UTC - in response to Message 59148. These very long WUâ€™s might be fine for the desktop machine with 4.0 GHz processors, but for 2.66 GHz laptops they are a problem. Most of the people running this project these days didnâ€™t sign up of models that take nearly half a year. Iâ€™m an old timer running CP since the days of the BBC experiment so I do know about long models. One thing I do know is that with models that long it is a very good idea to make backups. ID: 59151 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 59152 - Posted: 10 Dec 2018, 5:09:46 UTC Back in the old days, one could choose which model types a PC could run under preferences. But since almost everything is under the wah2 app now, you might get a 2 CPU day or a 60 CPU day model depending on the region and model months run. And you have no control over it. Of course running full up on a hyperthreaded/SMT PC really slows down individual model completion times compared to not running that way. On short models that's not such a big deal. On long models it really stretches it out. ID: 59152 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4563 Credit: 19,039,635 RAC: 18,944	Message 59153 - Posted: 10 Dec 2018, 8:16:31 UTC - in response to Message 59149. For Linux + Wine, it looks like just a matter of digging down in the C:\ drive to the BOINC folder, and copying 4+ Gigs to an external drive. As you say, "In Theory." The interesting bit which I can't remember is if one out of four plus tasks crashes is how to restore just the crashed one to a pre-crash state. I vaguely remember also that there can be problems if the restore point is prior to a trickle up. ID: 59153 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 59154 - Posted: 10 Dec 2018, 20:12:14 UTC When running UNIX Version 5 on an Onyx Computer in the years 1980-85 we made incremental backups every week because Winchester disks were very fragile. We made backups on tape with the "dump" and "restore" commands from Berkeley UNIX. All this is just history. Tullio ID: 59154 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 59155 - Posted: 10 Dec 2018, 20:44:39 UTC - in response to Message 59154. We made backups on tape with the "dump" and "restore" commands from Berkeley UNIX. All this is just history. Almost as old is what I do now. I still have a VXA-2 tape drive (and a spare on the shelf) on my desktop. I back up with this in my cron.weekly file. You can guess what is in the variables. # Let us pick block size. $MT -f $TAPE_DRIVE setblk 0 $FIND $FILES -xdev -print \| $CPIO $CPIO_O_OPTIONS > $TAPE_DRIVE 2>> $REPORT I do something similarl daily onto an external 1 TeraByte hard drive. ID: 59155 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 59156 - Posted: 11 Dec 2018, 5:03:19 UTC - in response to Message 59153. For Linux + Wine, it looks like just a matter of digging down in the C:\ drive to the BOINC folder, and copying 4+ Gigs to an external drive. As you say, "In Theory." The interesting bit which I can't remember is if one out of four plus tasks crashes is how to restore just the crashed one to a pre-crash state. I vaguely remember also that there can be problems if the restore point is prior to a trickle up. I donâ€™t think that in Windows it is practical to restore just one model from a backup. You just have to bite the bullet and accept the fact that all running models will be put back to the point that they were when the backup was made. Thatâ€™s why it is important to make frequent backups. I make one every two or three days. Trickles have never been a problem. Duplicate trickles are just rejected. ID: 59156 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4563 Credit: 19,039,635 RAC: 18,944	Message 59157 - Posted: 11 Dec 2018, 7:44:01 UTC - in response to Message 59156. Last modified: 11 Dec 2018, 7:47:53 UTC I donâ€™t think that in Windows it is practical to restore just one model from a backup. I thought it could be done with some editing of client_state.xml but I don't remember ever seeing the details of how to do it posted. ID: 59157 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59199 - Posted: 18 Dec 2018, 21:15:14 UTC - in response to Message 59157. I donâ€™t think that in Windows it is practical to restore just one model from a backup. I thought it could be done with some editing of client_state.xml but I don't remember ever seeing the details of how to do it posted. I think that was way back when the BBC experiment was running, and that it was on the now defunct PHP server. There are two parts to restoring: 1) Get the last contact number from the current client_state.xml, and put it into the old backup. (This stops BOINC from thinking that this is coming from a new computer.) 2) Do a massive amount of client_state.xml editing to isolate the task that's being restored. I never had to do the second part way back when the log runs were around, so I don't know what would be needed. Especially as now a lot of people run multiple projects on computers with large numbers of processors. A lot of recent sign-ups have 32 processor machines, and there is a 64 processor machine that's been crashing tasks for some time. And if a procedure was going to be posted here, it would have to allow for these people. Unless the advice was "only for experienced people", in which case I think that it would be simpler to just restore everything, and only do point 1) above. And if there ARE lots of failures due to the long run times, then perhaps it will get the researchers thinking about ways to make future models shorter. So that works too. ID: 59199 · Reply Quote