Message boards : Number crunching : Replanca Error/Sigseg fault.
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I notice that a retread I was crunching failed at a time close enough on my box to wonder if it was not coincidence.The other machine had a noticeably shorter time but was faster. The Windows machine which had the first go had a replanca error whereas my Linux box failed with sigsegv fault. I did have a reboot shortly before the failure but it was still running three minutes after I resumed computation for the task. Work unit is https://www.cpdn.org/cpdnboinc/workunit.php?wuid=10824511Task now in the last chance saloon on another windows machine. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,714,303 RAC: 6,015 |
All 3 WUs of 567 batch I got on my Linux boxes failed with SIGSEGV: segmentation violation i.e. this one |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,714,303 RAC: 6,015 |
Two of the WUS25 under Linux batch 583 crashed with SIGSEGV: segmentation violation ..... ..... Exiting... Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6999, iMonCtr=2 Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=6859, iMonCtr=2 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... Calling boinc_finish...03:09:40 (6859): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 they seem to be unsent after the crash |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
they seem to be unsent after the crash Do the re-issues only appear under the work unit once they have been sent? Not sure whether they go out next or go to the back of the queue behind the 10,000 odd tasks still waiting to be sent? |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,714,303 RAC: 6,015 |
they seem to be unsent after the crash These two in particular are now re-issued. One failed on Darwin and is in progress on a Windows machine, the other is in progress on Windows machine. I thought applications will run on a single platform only or I misunderstood the info? |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Two of the WUS25 under Linux batch 583 crashed with I've had all three batch 583 tasks that made it to the third trickle crash with sigsegv on my Linux box. Appears to be a linux app problem on this particular batch. Batch 583 tasks under Windows have made it well past that point. SIGSEGV: segmentation violation Stack trace (12 frames): /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357] [0x2a9e3ca0] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x81443f4] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a] /home/gdp/BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf7)[0x2a7ad637] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=11748, iMonCtr=2 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... Calling boinc_finish...08:10:00 (11748): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 |
Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0 |
I see the same issue with the segfaults: 100% failure rate on WUs from the 583 batch, all after about the same amount of CPU time (so this doesn't look random at all). Looking at my wing men, I could not find a single one where the task finished OK (including a few Windows computers). These are statistics on 14 failed WUs. Several of those are already counted out with 3 failures. With quite a few of my wing men I also saw this error ../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
With quite a few of my wing men I also saw this error That is an unrelated problem, - Those of us who use Linux have to (in most cases) manually install some 32bit libraries in order to get the executables to run. The sigseg fault problem has been reported to the project. I would guess the disk taking uploads will be the first problem to be addressed however. |
Send message Joined: 6 Aug 04 Posts: 124 Credit: 9,195,838 RAC: 0 |
Can confirm 583 does three trickles and then segfaults. https://www.cpdn.org/cpdnboinc/result.php?resultid=20467695 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467661 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467816 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467712 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467912 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467799 https://www.cpdn.org/cpdnboinc/result.php?resultid=20467522 They all went to the exact same point of 1,539.60 credits. Linux Users Everywhere @ BOINC |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Will email project to ask if those of us with Linux boxes should abort tasks from batch 583. Certainly thinking about doing this on my three boxes. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Yes, if running Linux, please abort these tasks as they will crash before 4th Zip is created. Oxford are trying to track down the root of the problem. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
And it seems those running Darwin should also abort tasks from this batch. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,714,303 RAC: 6,015 |
This one also failed on WIN and when my Ubuntu got the last reissue I simply aborted as advised. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
This one also failed on WIN However that computer seems to be failing over 80% of tasks thrown at it so possibly not a good measure of the reliability of this batch of tasks. |
Send message Joined: 7 May 17 Posts: 16 Credit: 3,480,030 RAC: 2,845 |
How can you tell what batch a task is from? |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
How can you tell what batch a task is from? For an example, wah2_eu50r_mzhp_20174_3_569_011014676_1, the batch is the number that precedes that long list of numbers near the end. So, in the above example, the batch is 569. The number preceding the batch number is the number of model months that task has, so it's a 3 model month task. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,373,077 RAC: 15,530 |
Batch 583 are 25month models. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Yeah. I was just using that task as an example of how to decode the batch and run length from the task name. But I can see how someone could think my post related to the crash of batch 583 tasks after 3 months/trickles. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,884,997 RAC: 4,577 |
[Dave Jackson wrote:]And it seems those running Darwin should also abort tasks from this batch. Confirmed on my Mac - SIGSEGV after three trickles. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
When you get this: With quite a few of my wing men I also saw this error it means you are running on a 64-bit machine with a 64-bit version of Linux. As soon as the executing program calls anything in those libraries, you get a null pointer instead of a pointer to the desired routine or function and off you go. Load those libraries and you should be OK. On my Red Hat Enterprise Linux 6.9 system, they are in $ rpm -qf libstdc++.so.6 libstdc++-4.4.7-18.el6.i686 <---<<< $ locate libstdc++.so.6 /usr/lib/libstdc++.so.6 /usr/lib/libstdc++.so.6.0.13 /usr/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6.0.13 ls -l libstdc++* lrwxrwxrwx. 1 root root 19 Mar 21 08:37 libstdc++.so.6 -> libstdc++.so.6.0.13 -rwxr-xr-x. 1 root root 930192 Oct 18 2016 libstdc++.so.6.0.13 |
©2024 cpdn.org