Message boards : Number crunching : Time for another moan about w/u not restarting
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
So, I found one machine had stopped responding by a vnc connection. I went and checked the machine and found its keyboard UN-responsive as well. So there was no way to shutdown Boinc. The mouse was still working so the machine was shutdown. Well we all know what this might mean for those weak and feeble climate w/u. On startup 2 w/u had fainted and died. Some 11 days of processing lost between them. But a grain of sand in the history of climate processing. I have bucket of ARP w/u to do instead. They love a restart, in-case there is a fault with the machine and it needs a restart. I really do get frustrated just how easy these climate w/u pass-out and die for any reason. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
So, I found one machine had stopped responding by a vnc connection. Do you think it was the VNC that failed? It has been a long time since I used one. Whenever I have had a frozen machine, it has usually been hardware. That may mean memory, though back when I had a lot of hard drives to store video files, it could be a drive. Even a drive with just video on it and not the OS could freeze up the machine. It was made worse by trying to record a video while watching another, and also by certain anti-viruses. The use of a write cache almost certainly meant trouble. BOINC just did not like that combination. I am long past that grief fortunately. With streaming video my PC is out of the loop. I haven't had a freeze in a couple of years. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
So, I found one machine had stopped responding by a vnc connection. I went and checked the machine and found its keyboard UN-responsive as well. So there was no way to shutdown Boinc. The mouse was still working so the machine was shutdown. Has everyone totally forgotten about making backups? I used to make them regularly. It’s fast and easy (at least in Windows, I never tried it in Linux.) That way if you lost a WU due to an external problem (outside the model itself) it can be restored. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
So, I found one machine had stopped responding by a vnc connection. I went and checked the machine and found its keyboard UN-responsive as well. So there was no way to shutdown Boinc. The mouse was still working so the machine was shutdown. Backups made a lot of sense when I had one or maximum two tasks running and they would take over six months to complete. Now, when I use up to eight out of sixteen threads, and tasks take a maximum of just over ten days to complete, to get back one of two tasks might mean losing more work than I can save. (Yes, I know it is possible to selectively restore tasks by editing client_state.xml but unless uber-careful it is very easy to make mistakes doing this. I have used backups in the past year but only when I have been following my work closely enough to know I won't lose more than I gain. Now I run more tasks at once I don't bother trying to sort it out for just one task. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
I thought you would need to stop all processing in order to make a copy/backup of working w/u's. Which could be fatal to a climate w/u anyway. If there is only 1 project running this also helps. But if there are multiple w/u from several projects running then it all becomes very messy I believe. Can a climate w/u thats done a trickle file upload be restored to a point before the trickle up??. I dont know. For most projects its get a w/u and dont worry, re-starts are not an issue. Climate seems to like its own dedicated machine and left alone - just feed it electricity. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've had power failures on infrequent occasions over the years, and most times, both BOINC and the climate models restart without a problem. The further away you get from a plain, simple computer, and into lots of fancy ways of using a computer, the more likely that the climate models will react badly with some of that fancy stuff . |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
Some of the "short" models were really, really bad about restarting - I tried run a few on a machine that gets shut down regularly (it stops BOINC, then shuts down 30s later, and has a SSD that can take an awful lot of write traffic), and they just weren't happy with it - it looked like they were complaining about not being able to find checkpoint files. I decided that, short of lots of troubleshooting of code that isn't mine to poke at, it's easier to just leave the CPDN workunits on the boxes that suspend properly. That one will suspend, but still pulls 150W or something when "sleeping." Old dual socket Xeon with a lot more RAM than it really needs. |
©2024 cpdn.org