Questions and Answers : Unix/Linux : Work units failing
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
What happens when you do try the standalone version? It stops running instantly ./wah2_8.12_i686-pc-linux-gnu wah2_sas50_mg69_198712_13_535_010968364 restart_atmos_s035_1995-1201_rd0001 restart_sas50_s035_1995-1201_rd0001 ic00000000_10_N96 GHGclim_ancil_14months_OSTIA_sst_v2_MIROC-ESM_1987-12-01_1989-01-30 ALLclim_ancil_14months_OSTIA_ice_1987-12-01_1989-01-30 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5 15:17:43 (9957): Can't open init data file - running in standalone mode Starting work on result cpdnou... Starting model in ... Cleaning up from the run... Application name and command line arguments retrieved from client_state.xml[/code] Running strace reveals that the init file it is talking about is a init_data.xml boinc file. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
Seems I'm able to at least start the model standalone now. I let a workunit start in boinc, suspending it before it failed. I then closed the boinc client, and started the application from the slots directory ../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu wah2_sas50_mnp0_201312_13_535_010978111 restart_atmos_s049_1998-1201_rd0001 restart_sas50_s049_1998-1201_rd0001 ic00000000_10_N96 GHGclim_ancil_14months_OSTIA_sst_v2_IPSL-CM5A-LR_2013-12-01_2015-01-30 ALLclim_ancil_14months_OSTIA_ice_2013-12-01_2015-01-30 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5 Starting work on result wah2_sas50_mnp0_201312_13_535_010978111_0_r321574378... Starting model in /var/lib/boinc-client/projects/climateprediction.net... Unzipping global model data file... Unzipping workunit control files... Trying to kill old process # 28144 Trying to kill old process # 28168 Trying to kill old process # 28169 Created shared memory region key = 228135 of size 73297376 bytes (version 608) Run for 1 Years and 1 Months Uploading restart files after month 12 pShMem->PRECIS_LATITUDE 146 pShMem->PRECIS_LONGITUDE 209 pShMem->EWSPACEA 0.440000 pShMem->NSSPACEA 0.440000 pShMem->FRSTLATA 38.720001 pShMem->FRSTLONA 324.359985 pShMem->POLELATA 79.949997 pShMem->POLELONA 236.679993 Copying files for startup... In pre_initialise_phase (part 1 of 3) In initialise_phase (part 2 of 3) In startup_phase (part 3 of 3) Copying model control files... Starting model ID wah2_sas50_mnp0_201312_13_535_010978111 Phase 1 Program launched with process id # 28345 Program launched with process id # 28346 Climate model starting - use graphics to monitor progress. Or visit the website to see the graphs for this run. Getting pthread attributes - retval=0 Setting pthread size (-1778384896 bytes) - retval=0 Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu Getting pthread attributes - retval=0 Setting pthread size (-1778384896 bytes) - retval=0 Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu wah2_sas50_mnp0_201312_13_535_010978111 - PH 1 TS 0000001 A - 01/12/2013 00:15 - H:M:S=0000:00:00 AVG= 0.25 DLT= 0.25 Detaching shared memory... Done I can rerun the application, it gives slightly varying results each time. Some times it will say that the model crashed, other times it will print two progress lines before saying done. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Knorr - When I look at /lib/i386-linux-gnu/libc.so.6 in the File Manager I see: 1.8 MB and (Type) Link to Unknown When I look at the Properties of libc.so.6 I see: Type Link to shared library (application/x-sharedlib) Link Target libc-2.23.so Size 1.8 MB (1,786,484 bytes) There is a file libc-2.23.so in this directory. It shows: 1.8 MB and (Type) Unknown The Properties are: Type: shared library (application/x-sharedlib) Size: 1.8 MB (1,786,484 bytes) Is there anything here that is different from yours that may be significant? |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
Knorr - My system is running libc-2.24.so instead. Otherwise, the locations and links are the same. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Is libc-2.24 from libc.so.6 or libc.so.8, which is a latter version and won't work. PS While you're testing this, it would be best if you set your percentage of processors to only get one task. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
Is libc-2.24 from libc.so.6 or libc.so.8, which is a latter version and won't work. ls -la /lib/i386-linux-gnu/libc.so.6 lrwxrwxrwx 1 root root 12 Nov 16 21:53 /lib/i386-linux-gnu/libc.so.6 -> libc-2.24.so I just noticed this bit in a strace. It is from the process that runs the wah2am3m2_um_8.12_i686-pc-linux-gnu application (which is the binary where the SEGV happens). Notice that we have a system call that sets the stack limit to a very large value, which is larger than 32 bits. I don't know if this is a problem, but it does look a bit suspicious. set_robust_list(0xf73841f0, 12) = 0 prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0 prlimit64(0, RLIMIT_STACK, {rlim_cur=18014398507745280*1024, rlim_max=RLIM64_INFINITY}, NULL) = 0 prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=18014398507745280*1024, rlim_max=RLIM64_INFINITY}) = 0 write(1, "Getting pthread attributes - retval=0\nSetting pthread size (-1778384896 bytes) - retval=0\nExecuting program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu\n", 197) = 197 |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
If I run the application in GDB with set follow-fork-mode child I get a SEGV at Thread 2.1 "wah2am3m2_um_8." received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x2a821700 (LWP 13991)] 0x08163769 in gw_satn () Disassembling wah2am3m2_um_8.12_i686-pc-linux-gnu, I get the following at that location (Which indeed is in the gw_satn symbol) 8163754: 0f af 8d fc fe ff ff imul -0x104(%ebp),%ecx 816375b: 8b b5 60 fe ff ff mov -0x1a0(%ebp),%esi 8163761: 8b bd 94 fe ff ff mov -0x16c(%ebp),%edi 8163767: 03 f1 add %ecx,%esi 8163769: f3 0f 10 1c 96 movss (%esi,%edx,4),%xmm3 816376e: 0f 28 e3 movaps %xmm3,%xmm4 8163771: f3 0f 59 e6 mulss %xmm6,%xmm4 8163775: 8d 34 0f lea (%edi,%ecx,1),%esi The register content is listed below (gdb) info registers eax 0xfdb37e30 -38568400 ecx 0xac000000 -1409286144 edx 0x7d 125 ebx 0xfdb37d9c -38568548 esp 0xfdad3d70 0xfdad3d70 ebp 0xfdb37d88 0xfdb37d88 esi 0xa9f45c80 -1443603328 edi 0xfdf3cb30 -34354384 eip 0x8163769 0x8163769 <gw_satn+825> eflags 0x10283 [ CF SF IF RF ] cs 0x23 35 ss 0x2b 43 ds 0x2b 43 es 0x2b 43 fs 0x0 0 gs 0x63 99 |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Is it significant that I have libc-2.23 (which is working) and Knorr has libc-2.24? Edit: I should write libc-2.23.so and linc-2-24.so. Edit: I justed checked my other two Ubuntu 16.04 64-bit machines and they both have libc-2-23.so. Edit: I think I saw Knorr is using Ubuntu 16.10. I am using Ubuntu 16.04. Could different (later) libraries be an issue? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've just loaded BOINC on to an old IBM laptop with barely enough memory for 1 task. It's running Linux Mint 17 (32 bits), and the file is libc-2.19.so The task has been running for 20 minutes, but none of this may help. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
I notice that even running the standalone version your boinc-client seems to be from the packaged version Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu I would run, "Top" and then use kill PIDno before starting BOINC-Manage. I have run into this problem before when packaged version is installed. Where did you install the standalone version? Starting it from within the directory where it is installed should ensure that the one to go with your boinc- manager runs instead. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
I notice that even running the standalone version your boinc-client seems to be from the packaged version By standalone I meant the CPDN binary itself, outside any BOINC context. Seemed like the easiest way to be able to run i.e. GDB/Valgrind on the application. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've sent this "upstairs", so let's see what happens. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
have stopped WINE and am going to see if it still works for me in 16.10 on my desktop. I know it works on my old 32bit netbook. EDIT:three minutes in and still running. So it isn't a problem with Ubuntu16.10 (packaged version) |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
have stopped WINE and am going to see if it still works for me in 16.10 on my desktop. I know it works on my old 32bit netbook. I don't know if you can find stdout for the application anywhere, but if you can, do you also get these messages? Setting pthread size (-1778384896 bytes) EDIT: Alternatively, you can use top/ps to find the PID of the wah2am3m2_um_8.12_i686-pc-linux-gnu application, and then do cat /proc/<pid>/limits |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
Now over an hour in and a second is 12 minutes in. Just waiting to see if everything happens later as it should now but that is a few days away. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
looking at my wingmen on this box now running Nix, I see as well as the usual crop of missing libs, at least two linux boxes which are crashing everything with a "failed to create thread" error. One is crashing things out at the start and the other about half an hour in from memory. Don't know if this is related or not. My problems with linux tasks have all been either missing libs or crashing on restart after reboot or power failure etc. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Knorr The latest that I've been able to get is Linux 3.16.0-38-generic, and BOINC 7.2.42 from Ubuntu. 7.2.42 is also the latest stable release from the BOINC site. Where did you get your OS and BOINC? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
The latest that I've been able to get is Linux 3.16.0-38-generic, and BOINC 7.2.42 from Ubuntu. 7.2.42 is also the latest stable release from the BOINC site. I am currently running the following all from xubuntu16.10 repositories. Operating System Linux 4.8.0-40-generic BOINC version 7.6.33 Edit: the upgrade from 38 to 40 on the kernel was only done yesterday on this box. Edit 2: 16.10 has had kernel 4.8.x since I installed it last year. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
Knorr As Dave mentioned this is standard on *ubuntu 16.10. Boinc is in the universe repository http://packages.ubuntu.com/yakkety/boinc Technically, I'm running Ubuntu Gnome (https://ubuntugnome.org/), but that should not affect any of this. It is just a Ubuntu using Gnome3 instead of Unity for the desktop environment. Packages come from the same repositories as the "official" Canonical Ubuntu Boinc was shipped at version 7.2.42 in Ubuntu 14.04. |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
Dave, were you able to get the output from above? It will list "Max stack size". For me it has a soft limit that looks like a integer overflow. It corresponds to the -1778384896 bytes interpreted as an unsigned integer. |
©2024 cpdn.org