climateprediction.net (CPDN) home page
Thread 'BOINC crashes when running CPDN'

Thread 'BOINC crashes when running CPDN'

Questions and Answers : Unix/Linux : BOINC crashes when running CPDN
Message board moderation

To post messages, you must log in.

AuthorMessage
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 56862 - Posted: 18 Sep 2017, 13:26:51 UTC
Last modified: 18 Sep 2017, 14:04:38 UTC

I have been getting crashes of BOINC when running CPDN on an Ubuntu 17.04 machine, and even after a reboot BOINC is not able to connect to its client. At first, it appeared to be a problem with the new BOINC 7.8.2, but after wiping out the machine and reverting to 7.6.33 it occurred again. That is a dedicated BOINC machine that runs 24/7, and the only other project running was WCG (usually MCM and FAHV is all they send me these days). I have been running CPDN on that machine for about a month, after using the single-line install procedure for the 32-bit memories, and had no problems until recently.

Since WCG is always very reliable for me, I have to suspect that CPDN is the cause of the crashes, but do not know. And the problem is that I can't then get BOINC working again, even after attempting an uninstall/reinstall, though most recently I have found a fix for that, in terms of resetting the permissions for BOINC, which appear to be messed up by the crashes.

I don't know how to proceed further, except to not run CPDN on Ubuntu machines and stick to my Windows machine. If anyone is interested in some more details, the problem is outlined on the BOINC forum.
http://boinc.berkeley.edu/dev/forum_thread.php?id=11853
(That discussion on BOINC 7.8.2 started with an OS X problem, which appeared to be related at first, but is apparently different.)

EDIT: This is the Ubuntu machine, but the number changed after the first crash.
https://www.cpdn.org/cpdnboinc/show_host_detail.php?hostid=1442709
https://www.cpdn.org/cpdnboinc/show_host_detail.php?hostid=1443282
ID: 56862 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 56863 - Posted: 18 Sep 2017, 15:45:49 UTC

I'm working in that BOINC thread 11853 that Jim mentions, and a couple of people have sent me their files for investigation - both Mac users, as it happens.

Both users have a failed CPDN task in their logs - WAH2 PNW. The task record is showing a crash dump, and 51 upload files - a total of about 14 KB for the <result> section in client_state.xml

There is a growing suspicion that the BOINC client's buffers can't handle that many upload files, and these tasks may be causing the problems. Is anyone successfully run one of these tasks, and - more to the point - has anyone completed one and uploaded the results successfully?
ID: 56863 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,815,352
RAC: 5,242
Message 56869 - Posted: 18 Sep 2017, 17:32:09 UTC
Last modified: 18 Sep 2017, 17:43:28 UTC

Richard: I've done PNW models with 51 uploads from an older batch (588) on Windows and Mac. I would have run them off-line, if that makes a difference. Or is this only a recent BOINC Manager update problem?

Lots of errors on the Mac: wah2_pnw25_c2es_200312_49_588_011084451_0. Peak swap size looks large, but I don't usually follow those variables ...

Actually, similar NetCDF conversion errors on Windows: wah2_pnw25_c2ok_200312_49_588_011084803_0.
ID: 56869 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 56870 - Posted: 18 Sep 2017, 18:18:12 UTC - in response to Message 56869.  

Haven't narrowed it down closely enough - it may be a combination of a crashed model for one of the familiar reasons, with a full dump/trace in stderr_txt, plus all those error -161 failed uploads. We don't think it's down to the new v7.8.2 alone, because people are reverting to previous versions and BOINC still won't start.

There is a proposed fix in the BOINC code-base, but unfortunately many fixes weren't deployed in what was announced as a new public release. There's talk of another version soon, this time including all fixes including this one - so fingers crossed. No timetable yet.

Unfortunately, task 20729988 - that started this discussion - probably won't ever be reported: we had to delete it to get the BOINC client to start running again. Might be worth keeping an eye on reissues, or other batch 658 tasks, to see if a pattern emerges.
ID: 56870 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56871 - Posted: 18 Sep 2017, 19:33:49 UTC
Last modified: 19 Sep 2017, 1:47:47 UTC

Edited after fully reading the boinc message board post.

====================================================================

Certain wah2 science app batches of cpdn tasks crash on otherwise stable Mac and Linux PCs. These batches will mostly crash after 1 model month, on Jan 1 of the next model year as the regional worker takes over after the global worker finishes that day. These batches run okay on Windows PCs. There is some problem that happens on these batches at that point. The common theme for these batches is "naturalized" parameter sets. This dates back to April.

However, in the last couple months, sometimes when this type of crash occurs, the boinc client can no longer communicate, and restarting boinc results in errors similar to the ones posted in this thread. The only way to recover from this for me has been to edit client_state.xml and remove all entries related to the crashed task.

I'm thinking some OS update occurred in the last couple months that changed some files in how boinc works with the science app, or writes something, or who knows... I'm not a programmer or system person, just an IT enthusiast.

The cpdn programmers know of the crash problem with the naturalized parameter sets, but have not been able to isolate the cause as to why it only occurs on Linux and Mac. The input files should be the same for the Windows app.

Whether a the boinc "corruption" problem occurs may depend on the OS distribution, version, and what updates have been run on it.

Edited...Apparently the problem with boinc losing communication and then not being able to start properly is due to a long stderr that boinc can't handle. Batches that have many months in them, that crash early, will have numerous failed uploads listed in stderr. This may be the source of the failure to restart boinc problem.
ID: 56871 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 56934 - Posted: 23 Sep 2017, 23:11:27 UTC
Last modified: 23 Sep 2017, 23:18:34 UTC

It seems I'm there as well, unpleasant surprise
I had one of these PNW 49 that was supposed to start earlier today and now BOINC does not communicate and it looks I have no projects attached. I will look in to the client_state.xml and hopefully will manage to get BOINC working again.

EDIT: All lines deleted and boinc now works
ID: 56934 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,815,352
RAC: 5,242
Message 56959 - Posted: 25 Sep 2017, 11:43:47 UTC

I've now completed one of the batch #658 49-month models on Windows to check whether the NetCDF conversion errors are still there - they are.
ID: 56959 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 56999 - Posted: 29 Sep 2017, 9:21:36 UTC - in response to Message 56959.  

My laptop has fallen prey to one of these tasks that I missed aborting. I have re-installed boinc but will go through the client-state.xml which I re-named when I have time to see if I can resurrect things.
ID: 56999 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 57034 - Posted: 4 Oct 2017, 15:58:51 UTC
Last modified: 4 Oct 2017, 16:19:55 UTC

There is a new BOINC 7.8.3 development version just released.
http://boinc.berkeley.edu/download_all.php
EDIT: Here are the release notes:
http://boinc.berkeley.edu/dev/forum_thread.php?id=11539#81802

I expect that this addresses the problem we have been having, due to the heroic efforts of Richard Haselgrove and others. But CPDN may no longer be sending out the Linux work units, and the new BOINC is not available in a repository yet, so not much can be done at the moment to check it out. But for the people who compile their own stuff and don't mind jumping into the fire, they might want to check it out. I think that Rosetta, if I read their forum correctly, had its own problem with 7.8.2, though not as bad as CPDN. And they issued new work units to avoid their particular bug anyway. Have fun.
ID: 57034 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 57047 - Posted: 5 Oct 2017, 14:14:23 UTC - in response to Message 57034.  

I have now installed BOINC 7.8.3 from the LocutusOfBorg PPA on an Ubuntu 16.04 machine, but of course no CPDN tasks yet. They will presumably need to figure out some way of sending tasks to the BOINC versions that can do them, while preventing them from going to the earlier BOINC versions. I don't recall having ever heard whether BOINC has an easy way to do that.
ID: 57047 · Report as offensive     Reply Quote
Pilot_51

Send message
Joined: 19 Sep 17
Posts: 9
Credit: 5,688,114
RAC: 1,074
Message 57054 - Posted: 5 Oct 2017, 19:06:11 UTC - in response to Message 57047.  

I kind of doubt they can restrict WUs by BOINC version, though that would be an ideal solution.

If I understand correctly, only the PNW models were crashing. Couldn't they isolate those to Windows and let Mac/Linux get the rest of the models that aren't known to crash or are OS restrictions an application-wide thing?
ID: 57054 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 57056 - Posted: 5 Oct 2017, 20:47:01 UTC - in response to Message 57047.  

I have now installed BOINC 7.8.3 from the LocutusOfBorg PPA on an Ubuntu 16.04 machine


Hi Jim, do you have a link for that. I could only find earlier versions than the one I got through the standard Ubuntu repositories when I googled LocutusOfBorg.

Thanks,

Dave
ID: 57056 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 57058 - Posted: 5 Oct 2017, 23:10:12 UTC - in response to Message 57056.  

Hi Jim, do you have a link for that.

This works for me:
sudo add-apt-repository ppa:costamagnagianfranco/boinc
sudo apt-get update

https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc

Then, you need to do a software update, then search for BOINC in the Ubuntu Software Center, and install it. It will be the latest (i.e., 7.8.3) version.
ID: 57058 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 57059 - Posted: 6 Oct 2017, 8:12:13 UTC - in response to Message 57058.  

Many thanks Jim,
a couple of niggles which I am sure were my fault but now running. I had tried building from source but kept getting being asked at the ./configure stage to tell it where the openssl libraries were located. I did that and still got the same message but as I have the packaged version up and running, I will leave investigating that problem for another day/month/year....
ID: 57059 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 57060 - Posted: 6 Oct 2017, 10:10:17 UTC - in response to Message 57059.  

Well, since you are coming from a different version, you might find this interesting. I manage my Ubuntu machine over the LAN using BoincTasks running on my main Windows 7 PC. To install the necessary files on the Ubuntu PC, I have had to learn a little bit about permissions (I am not a Linux expert), and here is what works for this way of installing BOINC (where USER is your user name):

BOINC: Search for BOINC in Ubuntu Software and install
• Join the root group: sudo adduser USER root
• Join the BOINC group: sudo adduser USER boinc
• Allow group to read, write and execute in /etc/boinc-client folder:
sudo chmod -R g+rwx /etc/boinc-client
• Allow group to read, write and execute in /var/lib/boinc-client:
sudo chmod -R g+rwx /var/lib/boinc-client
• Reboot
• Copy “gui_rpc_auth.cfg” to /etc/boinc-client folder
• Copy “remote_hosts.cfg” to /etc/boinc-client folder
• Copy “cc_config.xml” (if needed) to /etc/boinc-client folder
• Copy "gui_rpc_auth.cfg" to the home directory and reboot
• Copy the "app_config.xml" files (if needed) to the project folders in /var/lib/boinc/
ID: 57060 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 57491 - Posted: 16 Dec 2017, 12:56:39 UTC

While the latest BOINC (7.8.3) doesn't crash everything there are still problems for Mac and Linux builds. This is the latest update.


Sorry but not at the moment. We had discovered another rather intractable error and not easily identified with the lastest Linux and Mac builds where results get messed up when the model transitions a year. It appears to be a memory issue and related to building 32bit on a 64bit machine. I am currently trying to resolve, but it is proving rather messy. I can probably (and have locally) build a Linux app on a 32bit machine and deploy so that Linux users can crunch. I will look into that on Monday for you.
ID: 57491 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : BOINC crashes when running CPDN

©2024 cpdn.org