Questions and Answers : Unix/Linux : BOINC crashes when running CPDN
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I have been getting crashes of BOINC when running CPDN on an Ubuntu 17.04 machine, and even after a reboot BOINC is not able to connect to its client. At first, it appeared to be a problem with the new BOINC 7.8.2, but after wiping out the machine and reverting to 7.6.33 it occurred again. That is a dedicated BOINC machine that runs 24/7, and the only other project running was WCG (usually MCM and FAHV is all they send me these days). I have been running CPDN on that machine for about a month, after using the single-line install procedure for the 32-bit memories, and had no problems until recently. Since WCG is always very reliable for me, I have to suspect that CPDN is the cause of the crashes, but do not know. And the problem is that I can't then get BOINC working again, even after attempting an uninstall/reinstall, though most recently I have found a fix for that, in terms of resetting the permissions for BOINC, which appear to be messed up by the crashes. I don't know how to proceed further, except to not run CPDN on Ubuntu machines and stick to my Windows machine. If anyone is interested in some more details, the problem is outlined on the BOINC forum. http://boinc.berkeley.edu/dev/forum_thread.php?id=11853 (That discussion on BOINC 7.8.2 started with an OS X problem, which appeared to be related at first, but is apparently different.) EDIT: This is the Ubuntu machine, but the number changed after the first crash. https://www.cpdn.org/cpdnboinc/show_host_detail.php?hostid=1442709 https://www.cpdn.org/cpdnboinc/show_host_detail.php?hostid=1443282 |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
I'm working in that BOINC thread 11853 that Jim mentions, and a couple of people have sent me their files for investigation - both Mac users, as it happens. Both users have a failed CPDN task in their logs - WAH2 PNW. The task record is showing a crash dump, and 51 upload files - a total of about 14 KB for the <result> section in client_state.xml There is a growing suspicion that the BOINC client's buffers can't handle that many upload files, and these tasks may be causing the problems. Is anyone successfully run one of these tasks, and - more to the point - has anyone completed one and uploaded the results successfully? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,815,352 RAC: 5,242 |
Richard: I've done PNW models with 51 uploads from an older batch (588) on Windows and Mac. I would have run them off-line, if that makes a difference. Or is this only a recent BOINC Manager update problem? Lots of errors on the Mac: wah2_pnw25_c2es_200312_49_588_011084451_0. Peak swap size looks large, but I don't usually follow those variables ... Actually, similar NetCDF conversion errors on Windows: wah2_pnw25_c2ok_200312_49_588_011084803_0. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Haven't narrowed it down closely enough - it may be a combination of a crashed model for one of the familiar reasons, with a full dump/trace in stderr_txt, plus all those error -161 failed uploads. We don't think it's down to the new v7.8.2 alone, because people are reverting to previous versions and BOINC still won't start. There is a proposed fix in the BOINC code-base, but unfortunately many fixes weren't deployed in what was announced as a new public release. There's talk of another version soon, this time including all fixes including this one - so fingers crossed. No timetable yet. Unfortunately, task 20729988 - that started this discussion - probably won't ever be reported: we had to delete it to get the BOINC client to start running again. Might be worth keeping an eye on reissues, or other batch 658 tasks, to see if a pattern emerges. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Edited after fully reading the boinc message board post. ==================================================================== Certain wah2 science app batches of cpdn tasks crash on otherwise stable Mac and Linux PCs. These batches will mostly crash after 1 model month, on Jan 1 of the next model year as the regional worker takes over after the global worker finishes that day. These batches run okay on Windows PCs. There is some problem that happens on these batches at that point. The common theme for these batches is "naturalized" parameter sets. This dates back to April. However, in the last couple months, sometimes when this type of crash occurs, the boinc client can no longer communicate, and restarting boinc results in errors similar to the ones posted in this thread. The only way to recover from this for me has been to edit client_state.xml and remove all entries related to the crashed task. The cpdn programmers know of the crash problem with the naturalized parameter sets, but have not been able to isolate the cause as to why it only occurs on Linux and Mac. The input files should be the same for the Windows app. Edited...Apparently the problem with boinc losing communication and then not being able to start properly is due to a long stderr that boinc can't handle. Batches that have many months in them, that crash early, will have numerous failed uploads listed in stderr. This may be the source of the failure to restart boinc problem. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
It seems I'm there as well, unpleasant surprise I had one of these PNW 49 that was supposed to start earlier today and now BOINC does not communicate and it looks I have no projects attached. I will look in to the client_state.xml and hopefully will manage to get BOINC working again. EDIT: All lines deleted and boinc now works |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,815,352 RAC: 5,242 |
I've now completed one of the batch #658 49-month models on Windows to check whether the NetCDF conversion errors are still there - they are. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
My laptop has fallen prey to one of these tasks that I missed aborting. I have re-installed boinc but will go through the client-state.xml which I re-named when I have time to see if I can resurrect things. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
There is a new BOINC 7.8.3 development version just released. http://boinc.berkeley.edu/download_all.php EDIT: Here are the release notes: http://boinc.berkeley.edu/dev/forum_thread.php?id=11539#81802 I expect that this addresses the problem we have been having, due to the heroic efforts of Richard Haselgrove and others. But CPDN may no longer be sending out the Linux work units, and the new BOINC is not available in a repository yet, so not much can be done at the moment to check it out. But for the people who compile their own stuff and don't mind jumping into the fire, they might want to check it out. I think that Rosetta, if I read their forum correctly, had its own problem with 7.8.2, though not as bad as CPDN. And they issued new work units to avoid their particular bug anyway. Have fun. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I have now installed BOINC 7.8.3 from the LocutusOfBorg PPA on an Ubuntu 16.04 machine, but of course no CPDN tasks yet. They will presumably need to figure out some way of sending tasks to the BOINC versions that can do them, while preventing them from going to the earlier BOINC versions. I don't recall having ever heard whether BOINC has an easy way to do that. |
Send message Joined: 19 Sep 17 Posts: 9 Credit: 5,688,114 RAC: 1,074 |
I kind of doubt they can restrict WUs by BOINC version, though that would be an ideal solution. If I understand correctly, only the PNW models were crashing. Couldn't they isolate those to Windows and let Mac/Linux get the rest of the models that aren't known to crash or are OS restrictions an application-wide thing? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
I have now installed BOINC 7.8.3 from the LocutusOfBorg PPA on an Ubuntu 16.04 machine Hi Jim, do you have a link for that. I could only find earlier versions than the one I got through the standard Ubuntu repositories when I googled LocutusOfBorg. Thanks, Dave |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Hi Jim, do you have a link for that. This works for me: sudo add-apt-repository ppa:costamagnagianfranco/boinc sudo apt-get update https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc Then, you need to do a software update, then search for BOINC in the Ubuntu Software Center, and install it. It will be the latest (i.e., 7.8.3) version. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Many thanks Jim, a couple of niggles which I am sure were my fault but now running. I had tried building from source but kept getting being asked at the ./configure stage to tell it where the openssl libraries were located. I did that and still got the same message but as I have the packaged version up and running, I will leave investigating that problem for another day/month/year.... |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Well, since you are coming from a different version, you might find this interesting. I manage my Ubuntu machine over the LAN using BoincTasks running on my main Windows 7 PC. To install the necessary files on the Ubuntu PC, I have had to learn a little bit about permissions (I am not a Linux expert), and here is what works for this way of installing BOINC (where USER is your user name): BOINC: Search for BOINC in Ubuntu Software and install • Join the root group: sudo adduser USER root • Join the BOINC group: sudo adduser USER boinc • Allow group to read, write and execute in /etc/boinc-client folder: sudo chmod -R g+rwx /etc/boinc-client • Allow group to read, write and execute in /var/lib/boinc-client: sudo chmod -R g+rwx /var/lib/boinc-client • Reboot • Copy “gui_rpc_auth.cfg” to /etc/boinc-client folder • Copy “remote_hosts.cfg” to /etc/boinc-client folder • Copy “cc_config.xml” (if needed) to /etc/boinc-client folder • Copy "gui_rpc_auth.cfg" to the home directory and reboot • Copy the "app_config.xml" files (if needed) to the project folders in /var/lib/boinc/ |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
While the latest BOINC (7.8.3) doesn't crash everything there are still problems for Mac and Linux builds. This is the latest update. Sorry but not at the moment. We had discovered another rather intractable error and not easily identified with the lastest Linux and Mac builds where results get messed up when the model transitions a year. It appears to be a memory issue and related to building 32bit on a 64bit machine. I am currently trying to resolve, but it is proving rather messy. I can probably (and have locally) build a Linux app on a 32bit machine and deploy so that Linux users can crunch. I will look into that on Monday for you. |
©2024 cpdn.org