climateprediction.net (CPDN) home page
Thread 'Safe shutdown'

Thread 'Safe shutdown'

Questions and Answers : Unix/Linux : Safe shutdown
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
kay

Send message
Joined: 3 Jun 16
Posts: 4
Credit: 441,907
RAC: 0
Message 54635 - Posted: 11 Aug 2016, 16:31:36 UTC

Hi,

I'm running climateprediction.net on multiple machines now, most of them doing nothing else, all linux machines.
They are usually running 24/7, but since they are in my room sometimes I shut them down at night when it's too noisy.
I do this by remotely logging in into every of them, pausing the project in boinc, waiting for several minutes and then shut them down to prevent erros. Anyway I recently had almost all task crash after a restart, many of them running for over a million seconds now - considerable amount of energy/money investment here - just gone!
So what I'd like to know: is there a -completely- safe way of shutting down a system without risking to crash everything? I even watched the hdd i/o sometimes.
If it would involve manually starting the WUs without boinc and manually shut them down (both with scripts or something), that would be okay too.
Is there a manual on this?

Kind regards,
Kay
ID: 54635 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 54636 - Posted: 11 Aug 2016, 21:51:39 UTC - in response to Message 54635.  

The way that you're shutting down now should be OK.
But there are some model types that don't like being stopped at all, and, I think for all of them, there are certain times when they'll fail anyway.
This is when they've reached the point where they're gathering the data from files to zip it and send it back to the project.

Also, it's better to describe the time taken in percent, because it's at certain percentages that failures can occur, as per the last sentence above.

To see what has actually happened, you can look in the page for each model, at the Stderr list. Click the + to expand it.

ID: 54636 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,020,584
RAC: 20,684
Message 54637 - Posted: 12 Aug 2016, 11:33:46 UTC - in response to Message 54636.  

I would add that for whatever reason, failures on shutdown seem to be more common running tasks natively under Linux than when I use WINE. I started using WINE when there was a dearth of native Linux tasks but the two machines I use most still have it running as to swap back without losing anything would mean letting all the tasks complete. Perhaps the loss of time isn't significicant if I do end up running just one task on each of these machines for a while but given my experience of the lower number of failures, I am for the moment sticking with WINE on the ones that get shut down regularly.
ID: 54637 · Report as offensive     Reply Quote
kay

Send message
Joined: 3 Jun 16
Posts: 4
Credit: 441,907
RAC: 0
Message 54638 - Posted: 15 Aug 2016, 16:28:15 UTC
Last modified: 15 Aug 2016, 16:40:07 UTC

Thank you for your answers. Although I shut the machines down in the way I explained I just had another 4 tasks throwing erros instantly on restart - all with Error code 193.

For me this is unacceptable and I also do not want to run only one task per machine, as they are all capable of running 2 or 4 simultaneously. I would rather stop the project at all to save the energy, as sad it would be (for me).

Isn't there any manual which explains the behavior of the executables? I watched htop when suspending the tasks, they disappeared instantly, so I don't know where to start, as I first expected them to sustain for a bit until they have written everything to the harddisk, which might take a while.

I would then have tried to subsitute the regular shutdown routine with a custom script that would first wait for all the boinc tasks to have disappeared and then continue with the actual shutdown, as I thought they might recieve a SIGABRT or something when they do not disappear a certain time after the OS has send a SIGTERM when doing shutdown and then would have not completely written everything to harddisk.



...
But there are some model types that don't like being stopped at all, and, I think for all of them, there are certain times when they'll fail anyway.
This is when they've reached the point where they're gathering the data from files to zip it and send it back to the project.
...

If so, I think they should not be send out for boinc at all. Regardless that it shouldn't be too hard to make the executables more stable for example by ensuring everything written into files is marked as complete when done and if a file is read and there are uncomplete sections, to just throw them out and recalculate (assuming sequential writing).
ID: 54638 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 54640 - Posted: 15 Aug 2016, 18:23:33 UTC

kay -

All is not lost. I checked some of your computers and I see that you are getting credit for the "work" you have done even when the tasks get a 193 error.

It has been written many times over the years on this message board that the scientists can still use the results from the uploaded zip files even if the task didn't run to completion.

You are not wasting your computer time.
ID: 54640 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 54643 - Posted: 15 Aug 2016, 23:30:43 UTC - in response to Message 54638.  

Kay

The programs used here were created by, have/are being updated by, and are owned by the UK Met Office, where they run on supercomputers for the production of weather and climate data, and research into these.
It's said that the main ones are close to a million lines of FORTRAN, plus there's some C code programs, and several dozen auxiliary data files.

Finding where the one I mentioned is failing on desktop computers, is not something that is/will be done by the people at Oxford.
And the specific problem reported by you isn't being reported by others, so it's most likely specific to your computer(s).
Also, that model type doesn't get used often these days. Now, it's mostly the w@h2 model types that are favoured by the researchers.


Manual? Only the Internet I guess.
And Error code 193 seems to be a Windows problem. For Linux, I don't have a clue.

Also, I regret to have to tell everyone that uncompleted models aren't of much use these days, now that the research has shifted to professional climate physicists outside of Oxford.

The days when partial results were looked at by students at Oxford to find out how to avoid them, etc, are long gone.
But people will still get credit for each trickle received from a model that later fails.

ID: 54643 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 54644 - Posted: 16 Aug 2016, 0:46:29 UTC

Thanks for the update Les - especially on failed models being more or less useless.

I have not reported 193 errors but they are common for me on my Linux machines. I would guess 1/3 to 1/2 of my tasks fail on a LINUX (UBUNTU) reboot even when I follow the recommended "safe" shutdown. UBUNTU seems to have just as many updates as Windows that require a reboot.

193 errors under LINUX have something to do with segment violations and it seems the stack trace usually ends at libc.so.6




ID: 54644 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 54645 - Posted: 16 Aug 2016, 4:08:44 UTC - in response to Message 54644.  

Ah. Thanks for that.
That lib is the one that causes problems with trying to get the 32 bit stuff to work.
If I've got it right, it's there/used for compatibility with CentOS.

I'll pass this on. Perhaps the database can be scanned to see how many computers are still running that OS, and hopefully the old lib can be dumped in favour of the 8.nn version.

ID: 54645 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 54687 - Posted: 22 Aug 2016, 13:13:12 UTC

I would recommend shutting down BOINC remotely, then shutting down the system. BOINC will close all the task cleanly as possible.

For example:
/sbin/service boinc-client stop
/sbin/shutdown -h 1 &
ID: 54687 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,020,584
RAC: 20,684
Message 54699 - Posted: 24 Aug 2016, 9:41:40 UTC

Even shutting down by stopping tasks, then boinc then shutting down I lose more than just the occasional task when I restart. Using Hibernate (suspend to disk) instead, I don't seem to lose any and I don't have to stop tasks, wait, stop boinc etc. While I have never played with remote, I am sure this can be done remotely.
ID: 54699 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 54775 - Posted: 12 Sep 2016, 6:45:36 UTC - in response to Message 54699.  

Most of my machines use Ubuntu.
Back in June a few wu's failed.
Since June, never a compute fail. Like zero. Look at my name history.
Have heat-related shutdowns.
What I do?
I don't let the Ubuntu broken start-script restart my work-units. UBU-start-scripts are broken, and UBU BOINC version is way ahead of Berkeley Linux BOINC

I think that the Ubuntu start-scripts are broken.

Manually starting after backup works all the time.
Don't trust the UBU start-scripts, disable all that,
Works for me.

Not to boast or nothing -- but I do make UBU work for CPDN.

Questions welcome.

ID: 54775 · Report as offensive     Reply Quote
James Jadesword

Send message
Joined: 11 Jul 15
Posts: 10
Credit: 763,922
RAC: 0
Message 55049 - Posted: 30 Oct 2016, 16:21:46 UTC

My problem is a bit different. I was having a problem with a program and I paused the one task and paused the project and exited the BOINC Manager as I was required to reboot before testing the solution. When the solution tested good, I started the BOINC Manager and nothing was there. The project and the task were gone. What, if anything, do I need to do to reconnect the project and recover the task? Do I need to report to someone if I am unable to recover the task?
ID: 55049 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 55050 - Posted: 30 Oct 2016, 16:53:26 UTC - in response to Message 55049.  

My problem is a bit different. I was having a problem with a program and I paused the one task and paused the project and exited the BOINC Manager as I was required to reboot before testing the solution. When the solution tested good, I started the BOINC Manager and nothing was there. The project and the task were gone. What, if anything, do I need to do to reconnect the project and recover the task? Do I need to report to someone if I am unable to recover the task?


Don't worry about the lost task. It's likely there is no way to recover it, and if they need to rerun it for some purpose, it will be sent again under another work unit.
ID: 55050 · Report as offensive     Reply Quote
James Jadesword

Send message
Joined: 11 Jul 15
Posts: 10
Credit: 763,922
RAC: 0
Message 55051 - Posted: 30 Oct 2016, 19:43:28 UTC - in response to Message 55050.  

It would seem that whatever changed in the OS upgrade that caused LibreOffice to malfunction is also causing a malfunction in the BOINC Manager. I am unable to add a project.
ID: 55051 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55052 - Posted: 30 Oct 2016, 20:41:04 UTC

Hi James

On the bottom line of the manager, at the right hand end, does it say
Connected to localhost?

ID: 55052 · Report as offensive     Reply Quote
James Jadesword

Send message
Joined: 11 Jul 15
Posts: 10
Credit: 763,922
RAC: 0
Message 55055 - Posted: 31 Oct 2016, 7:35:49 UTC - in response to Message 55052.  
Last modified: 31 Oct 2016, 7:36:23 UTC

Got it in one. The only problem is that when I attempt to connect to localhost, an error message says that the BOINC Manager crashed three times in two minutes and the connection fails. I checked on the Berkley site, the Oxford site, and synaptic and all the versions are the same.
ID: 55055 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55056 - Posted: 31 Oct 2016, 7:45:36 UTC - in response to Message 55055.  
Last modified: 31 Oct 2016, 7:55:44 UTC

The two parts, Client and Manager, communicate with each other several times a second on one of the internal ports. I think it's "pi" - port 31416.
So, perhaps it's a permissions problem. Which would also explain why it happened after an upgrade.

edit
Forgot to say:
Your tasks are probably still there, except that the Manager can't show you what's happening with the behind-the-scenes client.
ID: 55056 · Report as offensive     Reply Quote
James Jadesword

Send message
Joined: 11 Jul 15
Posts: 10
Credit: 763,922
RAC: 0
Message 55059 - Posted: 31 Oct 2016, 12:04:36 UTC - in response to Message 55056.  

I am totally lost. How do I reconnect? What command? Where and how do I apply the command? That which is available in the graphical user interface does not have an option for a port.
ID: 55059 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55063 - Posted: 31 Oct 2016, 19:51:59 UTC - in response to Message 55059.  
Last modified: 31 Oct 2016, 19:52:27 UTC

Ah, bad news here I'm afraid.
Fixing this is DIY - you have to roll up your sleeves and get down among the plumbing.
i.e. Become super user and check permissions of various bits to find out what is blocking the connection.

Perhaps the best idea would be to move this to the BOINC/dev message board, under Questions and problems. It's not the first time this has cropped up.

Having mentioned BOINC/dev reminds me that there's been some recent posts there about problems with the latest version of Ubuntu, although they may have been in relation to GPU cards.
Some searching may turn up something useful.
ID: 55063 · Report as offensive     Reply Quote
James Jadesword

Send message
Joined: 11 Jul 15
Posts: 10
Credit: 763,922
RAC: 0
Message 55064 - Posted: 31 Oct 2016, 23:03:22 UTC - in response to Message 55063.  

Fixing this is DIY


The problem fixed itself. I turned off the computer while I was away and when I returned and booted it up, there were a couple of boot time errors. Seems that plymouthd was the problem. The upside is that when the boinc manager started, it was connected to local host. Yes, the project and the task were still there. The task is running as I type this. I hate problems that solve themselves as they tend to recur. Now off to the Mint site to find out if I can remove plymouth as, in my not so humble opinion, it is not essential.
ID: 55064 · Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Unix/Linux : Safe shutdown

©2024 cpdn.org