climateprediction.net (CPDN) home page
Thread 'Work units failing'

Thread 'Work units failing'

Questions and Answers : Unix/Linux : Work units failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55823 - Posted: 28 Feb 2017, 14:20:27 UTC - in response to Message 55822.  

What happens when you do try the standalone version?


It stops running instantly

./wah2_8.12_i686-pc-linux-gnu wah2_sas50_mg69_198712_13_535_010968364 restart_atmos_s035_1995-1201_rd0001 restart_sas50_s035_1995-1201_rd0001 ic00000000_10_N96 GHGclim_ancil_14months_OSTIA_sst_v2_MIROC-ESM_1987-12-01_1989-01-30 ALLclim_ancil_14months_OSTIA_ice_1987-12-01_1989-01-30 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5
15:17:43 (9957): Can't open init data file - running in standalone mode
Starting work on result cpdnou...
Starting model in ...
Cleaning up from the run...


Application name and command line arguments retrieved from client_state.xml[/code]

Running strace reveals that the init file it is talking about is a init_data.xml boinc file.
ID: 55823 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55824 - Posted: 28 Feb 2017, 14:46:54 UTC - in response to Message 55823.  

Seems I'm able to at least start the model standalone now.
I let a workunit start in boinc, suspending it before it failed. I then closed the boinc client, and started the application from the slots directory

../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu wah2_sas50_mnp0_201312_13_535_010978111 restart_atmos_s049_1998-1201_rd0001 restart_sas50_s049_1998-1201_rd0001 ic00000000_10_N96 GHGclim_ancil_14months_OSTIA_sst_v2_IPSL-CM5A-LR_2013-12-01_2015-01-30 ALLclim_ancil_14months_OSTIA_ice_2013-12-01_2015-01-30 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5
Starting work on result wah2_sas50_mnp0_201312_13_535_010978111_0_r321574378...
Starting model in /var/lib/boinc-client/projects/climateprediction.net...
Unzipping global model data file...
Unzipping workunit control files...
Trying to kill old process # 28144
Trying to kill old process # 28168
Trying to kill old process # 28169
Created shared memory region key = 228135 of size 73297376 bytes (version 608)
Run for 1 Years and 1 Months
Uploading restart files after month 12
pShMem->PRECIS_LATITUDE 146
pShMem->PRECIS_LONGITUDE 209
pShMem->EWSPACEA 0.440000
pShMem->NSSPACEA 0.440000
pShMem->FRSTLATA 38.720001
pShMem->FRSTLONA 324.359985
pShMem->POLELATA 79.949997
pShMem->POLELONA 236.679993
Copying files for startup...
In pre_initialise_phase (part 1 of 3)
In initialise_phase (part 2 of 3)
In startup_phase (part 3 of 3)
Copying model control files...
Starting model ID wah2_sas50_mnp0_201312_13_535_010978111   Phase 1
Program launched with process id # 28345
Program launched with process id # 28346
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.
Getting pthread attributes - retval=0
Setting pthread size (-1778384896 bytes) - retval=0
Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu
Getting pthread attributes - retval=0
Setting pthread size (-1778384896 bytes) - retval=0
Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu
wah2_sas50_mnp0_201312_13_535_010978111 - PH 1 TS 0000001 A - 01/12/2013 00:15 - H:M:S=0000:00:00 AVG= 0.25 DLT= 0.25
Detaching shared memory...
Done


I can rerun the application, it gives slightly varying results each time. Some times it will say that the model crashed, other times it will print two progress lines before saying done.
ID: 55824 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 55825 - Posted: 28 Feb 2017, 14:48:53 UTC

Knorr -

When I look at /lib/i386-linux-gnu/libc.so.6 in the File Manager I see:

1.8 MB and (Type) Link to Unknown

When I look at the Properties of libc.so.6 I see:

Type Link to shared library (application/x-sharedlib)
Link Target libc-2.23.so
Size 1.8 MB (1,786,484 bytes)


There is a file libc-2.23.so in this directory. It shows:

1.8 MB and (Type) Unknown

The Properties are:

Type: shared library (application/x-sharedlib)
Size: 1.8 MB (1,786,484 bytes)


Is there anything here that is different from yours that may be significant?
ID: 55825 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55826 - Posted: 28 Feb 2017, 15:12:25 UTC - in response to Message 55825.  

Knorr -

When I look at /lib/i386-linux-gnu/libc.so.6 in the File Manager I see:

1.8 MB and (Type) Link to Unknown

When I look at the Properties of libc.so.6 I see:

Type Link to shared library (application/x-sharedlib)
Link Target libc-2.23.so
Size 1.8 MB (1,786,484 bytes)


There is a file libc-2.23.so in this directory. It shows:

1.8 MB and (Type) Unknown

The Properties are:

Type: shared library (application/x-sharedlib)
Size: 1.8 MB (1,786,484 bytes)


Is there anything here that is different from yours that may be significant?


My system is running libc-2.24.so instead. Otherwise, the locations and links are the same.
ID: 55826 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55827 - Posted: 28 Feb 2017, 15:30:41 UTC - in response to Message 55826.  
Last modified: 28 Feb 2017, 15:33:06 UTC

Is libc-2.24 from libc.so.6 or libc.so.8, which is a latter version and won't work.

PS
While you're testing this, it would be best if you set your percentage of processors to only get one task.
ID: 55827 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55828 - Posted: 28 Feb 2017, 15:54:57 UTC - in response to Message 55827.  

Is libc-2.24 from libc.so.6 or libc.so.8, which is a latter version and won't work.


ls -la /lib/i386-linux-gnu/libc.so.6
lrwxrwxrwx 1 root root 12 Nov 16 21:53 /lib/i386-linux-gnu/libc.so.6 -> libc-2.24.so


I just noticed this bit in a strace. It is from the process that runs the wah2am3m2_um_8.12_i686-pc-linux-gnu application (which is the binary where the SEGV happens).

Notice that we have a system call that sets the stack limit to a very large value, which is larger than 32 bits. I don't know if this is a problem, but it does look a bit suspicious.

set_robust_list(0xf73841f0, 12)         = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
prlimit64(0, RLIMIT_STACK, {rlim_cur=18014398507745280*1024, rlim_max=RLIM64_INFINITY}, NULL) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=18014398507745280*1024, rlim_max=RLIM64_INFINITY}) = 0
write(1, "Getting pthread attributes - retval=0\nSetting pthread size (-1778384896 bytes) - retval=0\nExecuting program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu\n", 197) = 197

ID: 55828 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55829 - Posted: 28 Feb 2017, 16:44:57 UTC - in response to Message 55828.  

If I run the application in GDB with
set follow-fork-mode child
I get a SEGV at


Thread 2.1 "wah2am3m2_um_8." received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2a821700 (LWP 13991)]
0x08163769 in gw_satn ()

Disassembling wah2am3m2_um_8.12_i686-pc-linux-gnu, I get the following at that location (Which indeed is in the gw_satn symbol)


8163754: 0f af 8d fc fe ff ff imul -0x104(%ebp),%ecx
816375b: 8b b5 60 fe ff ff mov -0x1a0(%ebp),%esi
8163761: 8b bd 94 fe ff ff mov -0x16c(%ebp),%edi
8163767: 03 f1 add %ecx,%esi
8163769: f3 0f 10 1c 96 movss (%esi,%edx,4),%xmm3
816376e: 0f 28 e3 movaps %xmm3,%xmm4
8163771: f3 0f 59 e6 mulss %xmm6,%xmm4
8163775: 8d 34 0f lea (%edi,%ecx,1),%esi


The register content is listed below

(gdb) info registers
eax            0xfdb37e30	-38568400
ecx            0xac000000	-1409286144
edx            0x7d	125
ebx            0xfdb37d9c	-38568548
esp            0xfdad3d70	0xfdad3d70
ebp            0xfdb37d88	0xfdb37d88
esi            0xa9f45c80	-1443603328
edi            0xfdf3cb30	-34354384
eip            0x8163769	0x8163769 <gw_satn+825>
eflags         0x10283	[ CF SF IF RF ]
cs             0x23	35
ss             0x2b	43
ds             0x2b	43
es             0x2b	43
fs             0x0	0
gs             0x63	99

ID: 55829 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 55830 - Posted: 28 Feb 2017, 16:52:29 UTC
Last modified: 28 Feb 2017, 17:20:45 UTC

Is it significant that I have libc-2.23 (which is working) and Knorr has libc-2.24?

Edit: I should write libc-2.23.so and linc-2-24.so.

Edit: I justed checked my other two Ubuntu 16.04 64-bit machines and they both have libc-2-23.so.

Edit: I think I saw Knorr is using Ubuntu 16.10. I am using Ubuntu 16.04. Could different (later) libraries be an issue?
ID: 55830 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55832 - Posted: 1 Mar 2017, 7:14:17 UTC

I've just loaded BOINC on to an old IBM laptop with barely enough memory for 1 task.
It's running Linux Mint 17 (32 bits), and the file is libc-2.19.so
The task has been running for 20 minutes, but none of this may help.
ID: 55832 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 55833 - Posted: 1 Mar 2017, 8:39:52 UTC

I notice that even running the standalone version your boinc-client seems to be from the packaged version

Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu


I would run, "Top" and then use kill PIDno before starting BOINC-Manage. I have run into this problem before when packaged version is installed.

Where did you install the standalone version? Starting it from within the directory where it is installed should ensure that the one to go with your boinc- manager runs instead.
ID: 55833 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55834 - Posted: 1 Mar 2017, 8:43:08 UTC - in response to Message 55833.  

I notice that even running the standalone version your boinc-client seems to be from the packaged version

Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu


I would run, "Top" and then use kill PIDno before starting BOINC-Manage. I have run into this problem before when packaged version is installed.

Where did you install the standalone version? Starting it from within the directory where it is installed should ensure that the one to go with your boinc- manager runs instead.


By standalone I meant the CPDN binary itself, outside any BOINC context. Seemed like the easiest way to be able to run i.e. GDB/Valgrind on the application.
ID: 55834 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55836 - Posted: 1 Mar 2017, 10:04:36 UTC

I've sent this "upstairs", so let's see what happens.
ID: 55836 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 55837 - Posted: 1 Mar 2017, 12:29:37 UTC
Last modified: 1 Mar 2017, 12:40:42 UTC

have stopped WINE and am going to see if it still works for me in 16.10 on my desktop. I know it works on my old 32bit netbook.

EDIT:three minutes in and still running. So it isn't a problem with Ubuntu16.10

(packaged version)
ID: 55837 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55838 - Posted: 1 Mar 2017, 12:48:17 UTC - in response to Message 55837.  
Last modified: 1 Mar 2017, 13:22:25 UTC

have stopped WINE and am going to see if it still works for me in 16.10 on my desktop. I know it works on my old 32bit netbook.

EDIT:three minutes in and still running. So it isn't a problem with Ubuntu16.10

(packaged version)


I don't know if you can find stdout for the application anywhere, but if you can, do you also get these messages?

Setting pthread size (-1778384896 bytes)

EDIT:
Alternatively, you can use top/ps to find the PID of the wah2am3m2_um_8.12_i686-pc-linux-gnu application, and then do

cat /proc/<pid>/limits
ID: 55838 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 55839 - Posted: 1 Mar 2017, 13:44:35 UTC

Now over an hour in and a second is 12 minutes in. Just waiting to see if everything happens later as it should now but that is a few days away.
ID: 55839 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 55843 - Posted: 2 Mar 2017, 11:46:34 UTC

looking at my wingmen on this box now running Nix, I see as well as the usual crop of missing libs, at least two linux boxes which are crashing everything with a "failed to create thread" error. One is crashing things out at the start and the other about half an hour in from memory. Don't know if this is related or not. My problems with linux tasks have all been either missing libs or crashing on restart after reboot or power failure etc.
ID: 55843 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55846 - Posted: 3 Mar 2017, 22:20:58 UTC

Knorr

The latest that I've been able to get is Linux 3.16.0-38-generic, and BOINC 7.2.42 from Ubuntu. 7.2.42 is also the latest stable release from the BOINC site.

Where did you get your OS and BOINC?
ID: 55846 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 55848 - Posted: 4 Mar 2017, 9:16:57 UTC
Last modified: 4 Mar 2017, 9:32:21 UTC

The latest that I've been able to get is Linux 3.16.0-38-generic, and BOINC 7.2.42 from Ubuntu. 7.2.42 is also the latest stable release from the BOINC site.


I am currently running the following all from xubuntu16.10 repositories.

Operating System Linux
4.8.0-40-generic
BOINC version 7.6.33

Edit: the upgrade from 38 to 40 on the kernel was only done yesterday on this box.

Edit 2: 16.10 has had kernel 4.8.x since I installed it last year.
ID: 55848 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55850 - Posted: 4 Mar 2017, 13:23:33 UTC - in response to Message 55846.  

Knorr

The latest that I've been able to get is Linux 3.16.0-38-generic, and BOINC 7.2.42 from Ubuntu. 7.2.42 is also the latest stable release from the BOINC site.

Where did you get your OS and BOINC?


As Dave mentioned this is standard on *ubuntu 16.10. Boinc is in the universe repository http://packages.ubuntu.com/yakkety/boinc

Technically, I'm running Ubuntu Gnome (https://ubuntugnome.org/), but that should not affect any of this. It is just a Ubuntu using Gnome3 instead of Unity for the desktop environment. Packages come from the same repositories as the "official" Canonical Ubuntu

Boinc was shipped at version 7.2.42 in Ubuntu 14.04.
ID: 55850 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55851 - Posted: 4 Mar 2017, 13:27:18 UTC - in response to Message 55838.  


I don't know if you can find stdout for the application anywhere, but if you can, do you also get these messages?

Setting pthread size (-1778384896 bytes)

EDIT:
Alternatively, you can use top/ps to find the PID of the wah2am3m2_um_8.12_i686-pc-linux-gnu application, and then do

cat /proc/<pid>/limits


Dave, were you able to get the output from above? It will list "Max stack size". For me it has a soft limit that looks like a integer overflow. It corresponds to the -1778384896 bytes interpreted as an unsigned integer.
ID: 55851 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Questions and Answers : Unix/Linux : Work units failing

©2024 cpdn.org