climateprediction.net (CPDN) home page
Thread 'Time for another moan about w/u not restarting'

Thread 'Time for another moan about w/u not restarting'

Message boards : Number crunching : Time for another moan about w/u not restarting
Message board moderation

To post messages, you must log in.

AuthorMessage
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 63720 - Posted: 18 Mar 2021, 20:30:58 UTC

So, I found one machine had stopped responding by a vnc connection. I went and checked the machine and found its keyboard UN-responsive as well. So there was no way to shutdown Boinc. The mouse was still working so the machine was shutdown.

Well we all know what this might mean for those weak and feeble climate w/u. On startup 2 w/u had fainted and died. Some 11 days of processing lost between them. But a grain of sand in the history of climate processing.
I have bucket of ARP w/u to do instead. They love a restart, in-case there is a fault with the machine and it needs a restart.
I really do get frustrated just how easy these climate w/u pass-out and die for any reason.
ID: 63720 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 63721 - Posted: 18 Mar 2021, 22:43:58 UTC - in response to Message 63720.  

So, I found one machine had stopped responding by a vnc connection.

Do you think it was the VNC that failed? It has been a long time since I used one. Whenever I have had a frozen machine, it has usually been hardware. That may mean memory, though back when I had a lot of hard drives to store video files, it could be a drive. Even a drive with just video on it and not the OS could freeze up the machine. It was made worse by trying to record a video while watching another, and also by certain anti-viruses. The use of a write cache almost certainly meant trouble. BOINC just did not like that combination.

I am long past that grief fortunately. With streaming video my PC is out of the loop. I haven't had a freeze in a couple of years.
ID: 63721 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 63723 - Posted: 19 Mar 2021, 2:24:26 UTC - in response to Message 63720.  

So, I found one machine had stopped responding by a vnc connection. I went and checked the machine and found its keyboard UN-responsive as well. So there was no way to shutdown Boinc. The mouse was still working so the machine was shutdown.

Well we all know what this might mean for those weak and feeble climate w/u. On startup 2 w/u had fainted and died. Some 11 days of processing lost between them. But a grain of sand in the history of climate processing.
I have bucket of ARP w/u to do instead. They love a restart, in-case there is a fault with the machine and it needs a restart.
I really do get frustrated just how easy these climate w/u pass-out and die for any reason.


Has everyone totally forgotten about making backups? I used to make them regularly. It’s fast and easy (at least in Windows, I never tried it in Linux.) That way if you lost a WU due to an external problem (outside the model itself) it can be restored.
ID: 63723 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,004,017
RAC: 21,574
Message 63724 - Posted: 19 Mar 2021, 5:59:52 UTC - in response to Message 63723.  

So, I found one machine had stopped responding by a vnc connection. I went and checked the machine and found its keyboard UN-responsive as well. So there was no way to shutdown Boinc. The mouse was still working so the machine was shutdown.

Well we all know what this might mean for those weak and feeble climate w/u. On startup 2 w/u had fainted and died. Some 11 days of processing lost between them. But a grain of sand in the history of climate processing.
I have bucket of ARP w/u to do instead. They love a restart, in-case there is a fault with the machine and it needs a restart.
I really do get frustrated just how easy these climate w/u pass-out and die for any reason.


Has everyone totally forgotten about making backups? I used to make them regularly. It’s fast and easy (at least in Windows, I never tried it in Linux.) That way if you lost a WU due to an external problem (outside the model itself) it can be restored.


Backups made a lot of sense when I had one or maximum two tasks running and they would take over six months to complete. Now, when I use up to eight out of sixteen threads, and tasks take a maximum of just over ten days to complete, to get back one of two tasks might mean losing more work than I can save. (Yes, I know it is possible to selectively restore tasks by editing client_state.xml but unless uber-careful it is very easy to make mistakes doing this. I have used backups in the past year but only when I have been following my work closely enough to know I won't lose more than I gain. Now I run more tasks at once I don't bother trying to sort it out for just one task.
ID: 63724 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 63725 - Posted: 19 Mar 2021, 16:41:32 UTC

I thought you would need to stop all processing in order to make a copy/backup of working w/u's. Which could be fatal to a climate w/u anyway. If there is only 1 project running this also helps. But if there are multiple w/u from several projects running then it all becomes very messy I believe. Can a climate w/u thats done a trickle file upload be restored to a point before the trickle up??. I dont know.

For most projects its get a w/u and dont worry, re-starts are not an issue. Climate seems to like its own dedicated machine and left alone - just feed it electricity.
ID: 63725 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63726 - Posted: 19 Mar 2021, 21:42:36 UTC - in response to Message 63725.  

I've had power failures on infrequent occasions over the years, and most times, both BOINC and the climate models restart without a problem.

The further away you get from a plain, simple computer, and into lots of fancy ways of using a computer, the more likely that the climate models will react badly with some of that fancy stuff .
ID: 63726 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 63727 - Posted: 19 Mar 2021, 22:40:14 UTC

Some of the "short" models were really, really bad about restarting - I tried run a few on a machine that gets shut down regularly (it stops BOINC, then shuts down 30s later, and has a SSD that can take an awful lot of write traffic), and they just weren't happy with it - it looked like they were complaining about not being able to find checkpoint files.

I decided that, short of lots of troubleshooting of code that isn't mine to poke at, it's easier to just leave the CPDN workunits on the boxes that suspend properly. That one will suspend, but still pulls 150W or something when "sleeping." Old dual socket Xeon with a lot more RAM than it really needs.
ID: 63727 · Report as offensive     Reply Quote

Message boards : Number crunching : Time for another moan about w/u not restarting

©2024 cpdn.org