Questions and Answers :
Unix/Linux :
Shutting down for re-boot.
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,661,594 RAC: 14,529 |
I was hoping that this had ceased to be an issue and running tasks under WINE that seems to be the case, but running native linux tasks even if I suspend computation, wait a few minutes file exit I seem to lose a task about one in three times. I will revert to waiting till no tasks are present before updating kernel. Interested to know if others still experiencing this? Is it more of a problem during kernel updates or is it any shut down and restart? Not lost any if I hibernate. |
Send message Joined: 6 Aug 04 Posts: 124 Credit: 9,195,838 RAC: 0 |
How do you suspend computation? Per task, project or at all? Suspending each task seems to work better for me. Linux Users Everywhere @ BOINC |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,661,594 RAC: 14,529 |
I usually suspend per task and then globally, resuming in reverse order. It seems to be a particular issue when the kernel is updated. Restarting at other times seems to drop the failure rate to about one in ten but data a bit sparse because I don't restart that often except when kernel needs updating. |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Dave, I've had better luck with the following global compute preference checked: "Leave non-GPU tasks in memory while suspended" = yes If that is disabled on your account, try enabling it and see if that improves things. |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,661,594 RAC: 14,529 |
Thanks for adding that. I have in fact had "Leave non-GPU tasks in memory while suspended" enabled on my boxes for many years. The measures I have outlined are in addition to that. Not sure whether something has changed in the tasks or in more recent incarnations of BOINC but of late even when I have had restarts due to power failure, (electrician turning mains off without warning) I haven't lost tasks to it. Something has improved but I don't know what.(Over I would guess last 9 months to a year is an approximate time for the change.) |
Send message Joined: 12 Dec 14 Posts: 5 Credit: 14,162,005 RAC: 5,698 |
Hi. I've recently started running CPDN tasks on my Linux box again. Am I correct in remembering that HadAM4/N216 crash if I reboot the system? It would be nice to know. Thanks. Richard |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,661,594 RAC: 14,529 |
Hi. Last reboot, I lost one out of eight but that was an Ubuntu version upgrade which means a kernel upgrade so increases the chances of problems in my experience but I would say there is still a risk of it but probably needs more systematic research to assess the level of risk than my impressions without actually making a note of it every time. |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,192,402 RAC: 10,436 |
Hi.Running ubuntu 20.04 under Oracle VM VirtualBox on a Windoze10 machine. Before shutting down the ubuntu VM or Windoze, I always suspend the CPDN/BOINC activity. When Windoze decided on it's last unexpected update and reboot, I lost one Hadam4h (RIP) of the four CPDN tasks running on the ubuntu VM. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Thanks for adding that. I have in fact had "Leave non-GPU tasks in memory while suspended" enabled on my boxes for many years. The measures I have outlined are in addition to that. My experience similar. I do check "leave non-GPU tasks in memory" and I always suspend tasks when a reboot is optional (like for a kernel upgrade or such) before reboot. Yup, Something has improved but I don't know what. e |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,896,361 RAC: 16,952 |
I've sure not figured out how to do it. Even if I suspend tasks and stop BOINC before shutting down and rebooting, I still lose many over time. I've given up on running the Linux CPDN tasks on anything that reboots regularly. A one-off glitch is OK, but since I'm doing most of the work in a solar powered office, I just suspend the machines overnight and resume them in the morning - that doesn't bother anything. If I do need to reboot for updates or such, I drain the tasks out first and let them all finish before I deliberately reboot. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Weather is a "chaotic system", over a short period. and climate is an expansion of the short period into months / years / millennia. Most of the models that this project runs were developed by and for the UK Met Office, where they run on super computers NON STOP. Because of the "chaotic system" part, they are very sensitive to being interrupted, and attempting to run these desktop versions on anything other than a plain, simple, vanilla system, can lead to trouble. Anyone constantly having crashes, is using a computer that just isn't stable enough for this work, no matter how wonderful it is at doing everything else. And doing lots of other things at the same time may be a part of the problem. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
Weather is a "chaotic system", over a short period. and climate is an expansion of the short period into months / years / millennia. I am unconvinced. However chaotic the system being simulated by these models may be, the computer they run on is deterministic, so should always be repeatable. Failing this, the hardware the model runs on is either faulty (i.e., non-deterministic) or the software has bugs in it (e.g., uses parts of memory that are unasssigned values). In either of these two cases, it is not the model that is non-deterministic. On my main machine it takes about 8 days to run an N216 model and I reboot about once a week. So, since I run 3 or 4 models at a time, I always reboot while some are nominally running. My drill is that I set the stuff to no new tasks, suspend the running tasks, and then reboot, updating the system including sometime the kernel. As far as I remember, I have never lost a task since I got my current system ( I believe last September). |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Jean-David If you don't lose tasks, then your set-up is stable, so this doesn't apply to you. It's people who DO keep crashing tasks that have a problem, and only they can sort out why. And a lot of it is Windows, and the way it takes over, and re-boots when IT wants to. Admittedly, I haven't run Windows for years, so I'm only going by what I've read. Still, in the long run, the work does eventually get completed by someone, and that's what matters. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
Admittedly, I haven't run Windows for years, so I'm only going by what I've read. Me either. There was a bad batch some years ago that suffered from this, but IIRC, they fixed that problem in about a week. This machine I got about last September and I have had no problems with it once I got some SELinux problems sorted out. I also got a little Dell PC running Windows 10 so I could do my taxes on it, and download new maps for my Garmin GPS. Since it is sitting there with nothing else to do, I installed BOINC on it and it runs Climateprediction, WCG, Rosetta, and Universe. Computer 1512658 Domain name DESKTOP-K1UQGC4 Created 19 Dec 2020, 22:21:58 UTC CPU type GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1] Number of processors 8 Operating System Microsoft Windows 10 Core x64 Edition, (10.00.19043.00) It is currently running five WeatherAtHome tasks. All tasks for computer 1512658 State: All (6) · In progress (5) · Validation pending (0) · Validation inconclusive (0) · Valid (1) · Invalid (0) · Error (0) Application: All (6) · OpenIFS (0) · UK Met Office Coupled Model Full Resolution Ocean (0) · UK Met Office HadAM4 at N144 resolution (0) · UK Met Office HadAM4 at N216 resolution (0) · UK Met Office HadCM3 short (0) · UK Met Office HadSM4 at N144 resolution (0) · Weather At Home 2 (wah2) (6) · Weather At Home 2 (wah2) (region independent) (0) |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,512,201 RAC: 928 |
I have had very good luck NOT crashing models when re-booting by first shutting down the boinc-client. I am using UBUNTU, so I open a Terminal and enter "sudo service boinc-client stop", then Restart from the Desktop drop-down menu. I have NOT been able to find any documentation to figure out what the difference between - a) sudo service boinc-client stop, then Restart from the Desktop drop-down menu b) Click on Restart Now from the Software Updater window (after an update) c) reboot from the Terminal Do either b) or c) do an orderly termination of the boinc-client? When a computer crashes (locks up or freezes, or a power interruption, etc.) I expect model crashes and feel fortunate if they don't. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
I have had very good luck NOT crashing models when re-booting by first shutting down the boinc-client. I am not saying there is anything wrong with what you do. I do not have all the answers. on my RHEL8 system starting and stopping background tasks (actually all tasks) are done, and in the correct order, by the systemd system. I do not even need to know how that works. I know if I boot the system, the boinc-client is automatically started, after anything it requires to be running have already been started. And when I reboot the system, even implicitally, it takes them down also in the correct order. What I normally do is about a day before, I set no new tasks for all my projects. On the day I am going to do the system updates and reboot, I suspend all tasks that have not been started. I then have lunch or something. Then I suspend all the non Climateprediction tasks, and then the Climatprediction tasks. I then logout as me and login as root. I run the software program that checks for updates (that I already know there will be some. If it finds them, it downloads them, installs them, and reboots. |
Send message Joined: 7 Aug 04 Posts: 2183 Credit: 64,822,615 RAC: 5,275 |
I've sure not figured out how to do it. From looking at the tasks on your PCs, you have an exceptional record of completing tasks successfully. There was only one PC that had a significant number of failures and those were the hadcm3s models which are more finicky to begin with. There were some errors with the recent hadam4 (not hadam4h) batch but everyone had those as there was a batch configuration problem. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,896,361 RAC: 16,952 |
Good to know - though I'm not sure what that would look like if I were actually rebooting systems regularly instead of sleeping them... Production is down now with the heat, but will be back up this winter. I've been trying to figure out how to get the cheap cloud preemptable instances to hibernate cleanly so they can run CPDN units without having to actually stop/start them, but it's been tricky making it actually work reliably. |
©2024 cpdn.org