Questions and Answers : Unix/Linux : Trying to get tasks to not crash Linux client, now not receiving tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 Sep 17 Posts: 9 Credit: 5,688,114 RAC: 1,074 |
I recently started using BOINC again (previously used it in 2003-2004, other programs until 2013) to contribute to climate modeling. Unfortunately, I've been having a lot of trouble getting CPDN tasks to work properly on my main PC running Linux Mint 18.2. Here's the task list. I installed the BOINC client and manager from the Mint/Ubuntu repository and moved /var/lib/boinc-client to another partition with much more space, making sure to change the BOINC_DIR line in /etc/default/boinc-client. That seemed to work with up to 8 CPDN tasks running, but in the morning I found that the boinc client had crashed and would crash immediately after starting it again. This was what appeared in syslog when it crashed initially: Sep 19 05:29:38 mark-main systemd[1]: boinc-client.service: Main process exited, code=exited, status=193/n/a Sep 19 05:29:40 mark-main systemd[1]: boinc-client.service: Unit entered failed state. Sep 19 05:29:40 mark-main systemd[1]: boinc-client.service: Failed with result 'exit-code'. The exact same errors occurred each time I tried restarting the client. I tried modifying client_state.xml and deleting files to clear the problem task, but whatever I did didn't help and appeared to cause the remaining tasks to go into an error state at the next client start/crash. I then removed all references to the project I could find and moved the data directory back to /var/lib/boinc-client and reverted BOINC_DIR, thinking maybe it didn't like that I moved the directory. The client started and I started one task which appeared to get farther along, but that ran out of space as I was doing something else that used a lot of /tmp and caused a computation error in the task. I moved the /var/lib/boinc-client directory back to the other partition as before, but this time just used a symbolic link without changing BOINC_DIR. I also made sure to chown boinc:boinc on the moved directory. I started one task again, but again it reached around 10% and crashed the client. I found that it appeared to be trying to send a result around the same time, so I deleted just the result part from client_state.xml and that allowed the client to restart. However, even though just about every other reference to the task was automatically removed and I even reset the project, I wasn't receiving more tasks and the website kept the failed task as 'In Progress' until I removed and readded the project. I was going to try suspending network activity to see if that would prevent it from crashing before completion, but CPDN hasn't sent any new tasks for about a day now, I just keep getting this in the log: Fri 22 Sep 2017 03:16:29 PM EDT | climateprediction.net | Sending scheduler request: To fetch work. Fri 22 Sep 2017 03:16:29 PM EDT | climateprediction.net | Requesting new tasks for CPU Fri 22 Sep 2017 03:16:31 PM EDT | climateprediction.net | Scheduler request completed: got 0 new tasks Fri 22 Sep 2017 03:16:31 PM EDT | climateprediction.net | No tasks sent I can see in the server status that there are still plenty of unsent tasks in wah2, the same application that I was receiving before. Because of these issues and since the request_delay (communication deferred) time is so long, I've started contributing to WCG to fill the time, but I'd really prefer my resources go toward helping us understand the climate. I have not had a single issue with WCG after about 100 tasks. Fortunately, all is not lost for my CPDN efforts. I have an Intel NUC server running Debian that has so far been crunching without issue on 3 of its 4 cores, currently 23-39% between the tasks. Any help in resolving this would be appreciated. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Welcome back. Unfortunately it is a rough time for Linux and Mac users. The problem detailed in this thread https://www.cpdn.org/cpdnboinc/forum_thread.php?id=8474 is occurring with some tasks that have a lot of months in them. It is a combination of a boinc limitation/bug and cpdn task problem that affects both Mac and Linux on some tasks. And the result is the inability to get boinc to start back up and continue processing tasks. My advice is to continue on with WCG or other projects. Hopefully some way around this problem can be found. I think the developers are going to do something next week. It might be deprecating Mac and Linux apps until the cpdn problem is found, or creating a win only app for the problem task sets. Hopefully we'll have some news up next week on what the path forward will be. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Pilot_51 If you are interested in doing some experimentation, I might suggest you re-install BOINC completely using the default settings. I don't use Mint so I am no help there. But, I do use UBUNTU. To eliminate the possibly your move of the boinc-client didn't cause other problems, I would suggest leaving everything in the default locations. I know you wrote it was space limited. From my experience installing from the UBUNTU repository is an excellent way to go. I have tried moving Boinc files under UBUNTU once and so many problems I threw in the towel. I don't remember the details. If BOINC works with the "default" locations, you can try moving it and see if it works. If it doesn't work, you know why. |
Send message Joined: 19 Sep 17 Posts: 9 Credit: 5,688,114 RAC: 1,074 |
I completely wiped and reinstalled boinc-client and kept the data folder in the default location, making a backup copy of the fresh directory just in case. I also managed to free up about 7GB of space by uninstalling some software I hadn't used in a while, giving BOINC 9GB to work with and a 1GB margin. Unfortunately, that didn't fix the 0 tasks issue. I think the developers are going to do something next week. It might be deprecating Mac and Linux apps until the cpdn problem is found, or creating a win only app for the problem task sets. Hopefully we'll have some news up next week on what the path forward will be. That sounds very plausible and I hope the lack of tasks is intentional in an effort to prevent and ultimately fix the crash issue. Can any other Linux/Mac users confirm whether they've received new WUs since a day or two ago? I suppose it's possible the server just didn't like how all 17 tasks it sent this computer failed, 16 of which were abandoned. For now, I'll stick with WCG and continue checking CPDN for tasks, as well as keeping an eye out for any news on the crash issue. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Hi Pilot_51, I also haven't received Linux WUs in the last few days and after some info exchange it is very likely there are none in the hopper. In such cases I usually use WINE and I hadn't any issues on my two 14.04 LTS machines (there are few WINE related threads). I recently launched a i7-4790 Ubuntu 16.04 LTS machine with 10GB BOINC data space and it goes up to 5-7 GB so I guess you will be fine. This time I set a separate partition /var (during Ubuntu install) as I did not want to move around the CPDN data folder after install. I do have it moved on another HDD on one of my Linux boxes, but finding the instruction how to do it took a while, so I went for /var partition. |
Send message Joined: 19 Sep 17 Posts: 9 Credit: 5,688,114 RAC: 1,074 |
Yeah, I noticed fewer than 300 unsent WUs this morning and now it's 0, so now it's a wait for more to become available. I know I could use WINE and do use it for the occasional Windows-only game, but I'd rather not make that compromise and reduce the importance of them making things work correctly on Linux. If there's one thing I like less than running Windows-only software in WINE, it's running cross-platform software in WINE because the native build is broken or buggy, so I'd either deal with the bugs or not use it at all. I know, I'm weird. Once things get going again and I'm receiving WUs, I'll make sure it completes a task on the main partition and then see if simply moving the data dir breaks the next task. I'm still quite determined to find a stable solution that lets me store the CPDN data on another drive, though probably won't go as far as reinstalling the OS or changing the location of /var. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
There is a new post in the news section on this topic. |
Send message Joined: 19 Sep 17 Posts: 9 Credit: 5,688,114 RAC: 1,074 |
Thanks for the heads-up, that helps clear up what was going on. Interestingly, all 3 tasks given to my Debian server are still going great, currently at 41%, 58%, and 70%. It would appear that at least one bad batch was pnw25, since that is not running on my server and it was always running on my main system when the client crashed, including the very last task which was running alone. The second-to-last task that got further along and ran out of storage was cam25. All the earlier tasks were running alongside several others including two pnw25 tasks. So, I think it's safe to say that the location of the data dir had nothing to do with the crashes, and I honestly don't know how it could have. Without knowing exactly what was causing the crash in pnw25, I doubt there was anything that could be done short of using WINE to prevent it from crashing. If I were to receive more WUs with what I know now, assuming the issue wasn't fixed, I'd just abort any pnw25 tasks. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762 |
If I were to receive more WUs with what I know now, assuming the issue wasn't fixed, I'd just abort any pnw25 tasks. The tasks that caused the crashes have been deprecated for both Linux and Macs until a fix can be found. This will mean fewer tasks for us however. Work is still going on to try and identify the cause of the problem and resolve it but there has been no recent update on where this has reached. |
Send message Joined: 19 Sep 17 Posts: 9 Credit: 5,688,114 RAC: 1,074 |
Doh! A bit off topic, but I'll use this opportunity for a reminder. I lost one on my server because it didn't have libz.so.1. I think it happened at the very end as it was wrapping up. I made sure to install dependencies on my main system and forgot to do it on my server. I just did (lib32ncurses5 and lib32z1) and verified with ldd, so that should prevent the same thing occurring to the remaining two tasks with about 1.5 and 8 days remaining. For anyone getting started, remember to check/install dependencies on all systems! |
Send message Joined: 19 Sep 17 Posts: 9 Credit: 5,688,114 RAC: 1,074 |
Unfortunately, the remaining two tasks failed with the same error. Once the second of the three tasks failed, I restarted boinc-client to reload everything in hopes of saving the last task, but it didn't work. It would be great if BOINC checked dependencies before starting a task, displaying a warning if they aren't satisfied and requiring the user to resolve it before the task starts. It's a waste of resources to spend 15 days on a task that was doomed to fail from the beginning. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Regarding missing libz.so.1 - I had that error on one computer a few days ago (UBUNTU 16.04 LTS 64-bit). I forgot to check mt notes when I installed BOINC. I was missing lib32Z1. The following libraries MIGHT have something to do with the missing libz.so.1 library. Depending on the UBUNTU version they may not be available. But, in any case, I have installed all of them (if available). lib32z1 zlib1g zlib1g:i386 lib64z1 lib64z1:i386 libx32z1 libzadc1 Anyone think was a dumb idea? |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Regarding missing libz.so.1 - On any recent version of Ubuntu, I just run this and it takes care of everything. sudo apt-get install lib32ncurses5 lib32z1 gcc-4.7-multilib I'm sure it installs some items that aren't strictly necessary for getting cpdn running on 64 bit distributions, but it works. |
Send message Joined: 19 Sep 17 Posts: 9 Credit: 5,688,114 RAC: 1,074 |
On any recent version of Ubuntu, I just run this and it takes care of everything. My server is Debian and gcc-4.7-multilib isn't available in the repo. I would think that if all dependencies are satisfied according to ldd, as accomplished by installing lib32z1 in this case, nothing more would be needed. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Indeed. That should be fine. |
Send message Joined: 7 May 17 Posts: 16 Credit: 3,480,030 RAC: 2,845 |
Any news on Linux/Mac tasks? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762 |
Any news on Linux/Mac tasks? I would guess that at least until BOINC7.8.3 becomes widespread or the issue crashing some of the WAH2tasks is resolved there won't be a lot. The last I got were two hadcm3 tasks that had already failed on one or two other computers and promptly failed on mine also. |
Send message Joined: 7 May 17 Posts: 16 Credit: 3,480,030 RAC: 2,845 |
Any news on Linux/Mac tasks? (Do we have any idea how much compute capacity is idled by the lack of Linux/Mac workers? Hopefully not that much? Though the backlog of tasks seems pretty high now...) |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762 |
Any news on Linux/Mac tasks? Afraid not, at some point there will probably be some more hadcm3s tasks but it is down to the researchers giving Oxford the work to send out. |
Send message Joined: 19 Sep 17 Posts: 9 Credit: 5,688,114 RAC: 1,074 |
|
©2024 cpdn.org