Questions and Answers : Unix/Linux : Work units failing
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
My host recently got some work units. All of them failed within seconds with a segmentation violation though. Looking at the other hosts for the same work unit shows that most tasks are failing on different hosts as well. Although the other hosts seem to fail immediately with various other errors. Is there a problem with the work units recently released? https://www.cpdn.org/cpdnboinc/results.php?hostid=1417684 <core_client_version>7.6.33</core_client_version> <![CDATA[ <stderr_txt> SIGSEGV: segmentation violation Stack trace (13 frames): /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x83b4d8f] [0x2aa03cb0] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8163769] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x81696b4] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x81606cd] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x816add4] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x815f531] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8084b03] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x809404a] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8314d85] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8316ddf] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8341332] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf6)[0x2a7bb276] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=5818, selfPID=5738, iMonCtr=1 Model crash detected, will try to restart... Leaving CPDN_Main::Monitor... 16:20:34 (5738): called boinc_finish(0) </stderr_txt> |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Knorr - I don't guarantee this, but I think you are missing the 32-bit version of libc.so.6. Check to see if you have the package libc-i386 installed. If not, do a sudo apt-get install libc6-i386 or use some other method (Synaptic) to install it. Let us know if this solves your issue. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
32-bit libraries are installed. I checked all the climate@home binaries with ldd, and none of them are missing libraries. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Yes, that's a different fault. This one has a counterpart in Windows, but with different wording. I'm not sure of the cause; perhaps a corrupt file, or something locking/using a resource at the very moment that it's needed. edit I've just had a look at your Tasks list; ALL of them have failed, so something is wrong at your end. 1st step Reset the project to get rid of all of the climate binaries/files, then start again with downloading new copies. 2nd step If the above doesn't work, then look into permissions for the 2 locations of BOINC. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Knorr - Do what Les wrote. He knows way more about this stuff than I do. Doing an ldd is not the same as what I wrote in my post for you to do. An ldd shows that, yes, libc.so.6 is installed. But, it doesn't show the correct VERSION (32 bit) is installed. I have been down this path myself. I can't find the solution in my notes so I am going from memory. That is why I didn't guarantee my suggestion. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
Tried resetting the project. Verified that all files in the /var/lib/boinc-client/projects/climateprediction.net was deleted. Requested a new batch of work, same problem.
I'm not sure what the second location you are referring to is? The permissions of the /var/lib/boinc-client/projects/climateprediction.net files looks fine. I have einstein and rosetta running as well, and they have the same owner. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
ldd shows me which file is actually being used by the linker, not only if it is installed. All libraries reported by ldd are in /lib/i386-linux-gnu: ldd wah2am3m2_um_8.12_i686-pc-linux-gnu linux-gate.so.1 => (0xf7743000) libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xf7711000) libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xf76bb000) libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xf769c000) libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xf74e2000) /lib/ld-linux.so.2 (0x565c1000) A "file" on the actual libc shows that it is indeed 32 bit /lib/i386-linux-gnu/libc-2.24.so: ELF 32-bit LSB shared object, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, BuildID[sha1]=005209e623ca3b594b1c902c191b148ff2036623, for GNU/Linux 2.6.32, stripped |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
BOINC is in 2 parts; the second part is /usr/bin according to Installing BOINC on Ubuntu (Under: What the installer does) Since cpdn has become mostly Windows, I've stopped running a Linux version of BOINC, and use Windows running under Wine. So I don't know where the second lot of files are these days, but if you have other projects running OK, then they're most likely to be OK. Which leaves the complete failure of cpdn tasks a mystery still to be solved. Perhaps search the net for SIGSEGV: segmentation violation and look for clues. ************ Missing 32 bit lib will fail after about 6 seconds, giving this: ../../projects/climateprediction.net/hadam3prm3pm2t_eu_7.01_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've just started a new batch of 4, one of which is here. This is it's last chance; the first attempt failed because of: ../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory The 2nd attempt failed because of: ../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu: /opt/McAfee/runtime/2.0/lib/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by ../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu) |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
My ldd loks like this. However, I have a different "gnu" so this may be meaningless. bob@Tiger4:/var/lib/boinc-client/projects/climateprediction.net$ ldd wah2_8.12_i686-pc-linux-gnu linux-gate.so.1 => (0xf77b0000) libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xf7769000) libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xf7761000) libstdc++.so.6 => /usr/lib32/libstdc++.so.6 (0xf75e9000) libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xf7591000) libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xf7571000) libc.so.6 => /lib/i386-linux-gnu/ Edit: I know I have had the issue you are having at some point in the past. I just can't remember how I fixed it. I will look into it some more tomorrow. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
BOINC is in 2 parts; the second part is /usr/bin according to Installing BOINC on Ubuntu Ah, of course. The BOINC binaries are fine. As you mention, I'm able to run other projects. I got CPDN unit about a month ago. It produced several trickles. https://www.cpdn.org/cpdnboinc/result.php?resultid=20144359
A SEGV is a very generic message. Basically, the application is trying to access an invalid memory address. Often someone trying to read/write a NULL pointer. A NULL pointer could appear because something is not available on my system, but the stacktrace does not really give any clues as to what this could be. Is there anywhere I can see which arguments are passed to the CPDN binary for a certain work unit? I might be able to pick up some more forensics if I can run the application outside BOINC. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Is there anywhere I can see which arguments are passed to the CPDN binary for a certain work unit? It's not as simple as that. I think that there may be several binaries, and there's also lots of data files (lists) that are accessed by them. Part of the testing is just making sure that the contents of lists match up with what is expected in several places. In Windows, there's ProcMon, (Process Monitor), which can be run to see what happens when. But it generates LOTS of data. Another possibility is to look at your lists of tasks, and then at the Workunit column, to select ones that have failed before, and see what happened to them. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
According to the list in your first post, the binary in question is the um. And the last line of the stack trace is /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf6)[0x2a7bb276] which is that 32 bit lib that causes problems when missing. I'm wondering if the problem is something to do with stack memory, although I don't see how. There's an article here, but it seems to say that stack size allocation is automatic. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
The only box I have running BOINC natively under nix at present is an old 32 bit machine. Two other boxes are set to no new tasks so I can try and get some Linux work when WINE tasks finish. I have normally run BOINC on nix by extracting the tarball rather than using the packaged version so it may be worth trying that as an alternative. Trouble is by the time my tasks running under WINE are finished there may not be any Linux work available :( |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
According to the list in your first post, the binary in question is the um. The stack trace shows you how we reached the place where we get the SEGV. Basically, there are 13 nested function calls. We can see that we start in __libc_start_main+0xf6, which does not really sound surprising. I think it might tell us that the SEGV happened in the main thread/process of the application. Unfortunately, the debug symbols of the cpdn part seems to be missing, so we get no clues about what the process is doing, other than it ends up in the function boinc_catch_signal. If I manage to get new units I will try to stop them before they fail. I might be able to find the command line arguments given, and maybe find out what kind of violation I get. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Knorr - I found the reference to my having the same error as you are having. So, my brain isn't completely dead. There is a thread with the title: Lots of tasks end with "Error While Computing". Is there a problem at my end? which has a last post from me on 06 Aug 2014. If you scroll down to message 49556 (13 July 2014) you will see a post from me which I think is exactly the error your are getting. The "last" (newest) post (49556) states that I re-installed libc6 and the "problem" went away. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
Knorr - Does look like the same problem (at least the same nature). I have done a apt-get install --reinstall libc6-i386 Although, I'm not too optimistic. md5sum of the libc.so.6 file shows the same before and after the reinstall. Perhaps a batch of "bad" units were flushed while you were debugging, and the symptom went away. Anyway, there are no more jobs to get right now, so we will see whenever they pick up again. Thanks for your help so far. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Just to say some more work for nix has been poured into the hopper. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
Got a new batch of work units, which fails in the same way. Although I'm able to determine which arguments are given to the binary, there is a lot of inter dependencies going on. Have not had any luck to run the application stand-alone. I have disabled the application in my settings for now. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Have not had any luck to run the application stand-alone. The term, "sucking eggs" probably applies here but If attempting to run the standalone setup, it is worth making sure boinc-client isn't running from the packaged version when you try and start the standalone one as it seems to stop things working in my experience. What happens when you do try the standalone version? |
©2024 cpdn.org