climateprediction.net (CPDN) home page
Thread 'Work units failing'

Thread 'Work units failing'

Questions and Answers : Unix/Linux : Work units failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 55852 - Posted: 4 Mar 2017, 15:47:13 UTC

cat /proc/<pid>/limits


cat /proc/2083/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 18446744071931166720 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 15592 15592 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 15592 15592 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

Will also look at stdout
ID: 55852 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 55853 - Posted: 4 Mar 2017, 15:53:58 UTC

stdout_rm.text for one of two tasks currently running.


Starting HadRM3 model for ID# 233635...
Attaching shared memory segment...
Setting run-time Fortran environment...

UM environment variables in use:
ASTART=dataout/region_restart.day
UNIT11=dataout/xacxf.phist
UM_SECTOR_SIZE=2048
UNIT02=jobs/xacxf
UM_LBC_COUP=0
VN=4.5
TYPE=CRUN
UNIT09=tmp/xacxf.namelists
UNIT22=datain/ancil/ctldata/STASHmaster
STASETS_DIR=datain/ancil/ctldata/stasets
CACHE2=tmp/xacxf.cache2
UNIT08=tmp/xacxf.pipe_dummy
UNIT14=tmp/xacxf.errors
APSUM1=tmp/xacxf.apsum1
APSTMP1=tmp/xacxf.apstmp1
AOTRANS=tmp/xacxf.aotrans
UNIT04=jobs/xacxf.stashc
UNIT05=jobs/xacxf.namelists
DATAM=dataout/
UNIT12=dataout/xacxf.thist
UNIT10=dataout/xacxf.phist
UNIT06=dataout/xacxf.out
UNIT00=dataout/xacxf.err
AINITIAL=dataout/region_restart.day
UNIT57=jobs/spec3a_sw_3_asol2c_hadcm3
UNIT80=jobs/spec3a_lw_3_asol2c_hadcm3
SWSPECTD=jobs/spec3a_sw_3_asol2c_hadcm3
LWSPECTD=jobs/spec3a_lw_3_asol2c_hadcm3

Changing to slots dir /var/lib/boinc-client/slots/1
ID: 55853 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55855 - Posted: 4 Mar 2017, 17:47:57 UTC - in response to Message 55852.  


Max stack size 18446744071931166720 unlimited bytes


Identical to what I see on my system. Not sure what happens when you set such large stack size limit.
ID: 55855 · Report as offensive     Reply Quote
Desti

Send message
Joined: 6 Aug 04
Posts: 124
Credit: 9,195,838
RAC: 0
Message 56075 - Posted: 14 Apr 2017, 19:27:22 UTC - in response to Message 55830.  

Is it significant that I have libc-2.23 (which is working) and Knorr has libc-2.24?

Edit: I should write libc-2.23.so and linc-2-24.so.

Edit: I justed checked my other two Ubuntu 16.04 64-bit machines and they both have libc-2-23.so.

Edit: I think I saw Knorr is using Ubuntu 16.10. I am using Ubuntu 16.04. Could different (later) libraries be an issue?



It's not an Ubuntu problem, I've the same problem with Gentoo.

Would be interesting to know, on what system the binaries were compiled.
Linux Users Everywhere @ BOINC
ID: 56075 · Report as offensive     Reply Quote
Desti

Send message
Joined: 6 Aug 04
Posts: 124
Credit: 9,195,838
RAC: 0
Message 56231 - Posted: 16 May 2017, 23:07:57 UTC

The new batch of work units are mass failing either after a few seconds or after restarting computation. But this time with a new error:

<message>
finish file present too long
</message>

Linux Users Everywhere @ BOINC
ID: 56231 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 56232 - Posted: 17 May 2017, 4:42:06 UTC - in response to Message 56231.  
Last modified: 17 May 2017, 5:14:48 UTC

The new batch of work units are mass failing either after a few seconds or after restarting computation. But this time with a new error:


And not just under Linux. Message has been sent to project.

Edit: And Sarah has replied that it is about 400 of this batch of 7200 which seem to have a corrupted restart file. If it becomes apparent that it is a bigger problem, then they will look at whether the batch should be withdrawn.
ID: 56232 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 56256 - Posted: 18 May 2017, 11:51:16 UTC

Yours was batch 569. I have just had two fail just before creation of first upload on batch 568 both with SIGSEGV: segmentation violation both within 20seconds of the same cpu time.
ID: 56256 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 56257 - Posted: 19 May 2017, 10:45:47 UTC

And batch 564
2 just fell over at start with

<stderr_txt>
forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /var/lib/boinc-client/projects/climateprediction.net/hadcm3s_83ys_201412_120_564_011007447/jobs/climate.cpdc, line 396, position 20
Image PC Routine Line Source
ID: 56257 · Report as offensive     Reply Quote
Desti

Send message
Joined: 6 Aug 04
Posts: 124
Credit: 9,195,838
RAC: 0
Message 56259 - Posted: 19 May 2017, 14:37:05 UTC

ID: 56259 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 56260 - Posted: 19 May 2017, 15:03:08 UTC - in response to Message 56259.  

What's with this 3 units, they all ended with completed, but got 0 credits.


The credits don't appear till after the next run of the credit script which is only once a week these days.
ID: 56260 · Report as offensive     Reply Quote
[B@P] Daniel

Send message
Joined: 4 Apr 17
Posts: 3
Credit: 2,709,972
RAC: 0
Message 56270 - Posted: 20 May 2017, 11:34:51 UTC

I got bunch of WU failures recently. It turned out that 32-bit version of libz.so.1 was missing on my systems. After installing it "ldd wah2_se_8.25_i686-pc-linux-gnu.so" stopped complaining about missing dependency. Unfortunately all my computing time is lost here. It would be good if your libs would be statically linked with all needed libs to avoid such problems, especially when you use 32-bit apps on 64-bit systems.
ID: 56270 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 56273 - Posted: 20 May 2017, 19:29:19 UTC - in response to Message 56270.  

I and all of the forum admins would agree with you. At least for Ubuntu there are clear instructions on this forum for installing the requisite 32bit libs which hasn't always been the case. I think the problem is in ensuring that the libs will work across the various flavours of Linux.

However it will require an initiate of higher degree than myself to confirm that this is or possibly is not an issue.
ID: 56273 · Report as offensive     Reply Quote
[B@P] Daniel

Send message
Joined: 4 Apr 17
Posts: 3
Credit: 2,709,972
RAC: 0
Message 56275 - Posted: 21 May 2017, 10:56:34 UTC
Last modified: 21 May 2017, 10:57:22 UTC

Usually static linking of all required libs is enough, most BOINC projects do so.
One problem may be with linking with old versiops of glibc, Einstein@Home and Albert@home created their ARM apps by linking it with some very old version of glibc, so it needs some symbol which is no longer present in glibc version installed on Odroids. Beside this other differences between Linuxes usually means different platform string, so project can prepare another app for them.

Another option is to load all used libs early, by doing so app would crash early (within 1st minute) so not much time will be lost.

BTW, today I noticed that I got credits for some of this failed WUs. I do not know why, but thanks for this.
ID: 56275 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56276 - Posted: 21 May 2017, 14:25:35 UTC - in response to Message 56275.  

BTW, today I noticed that I got credits for some of this failed WUs. I do not know why, but thanks for this.

CPDN awards credits based on how many trickles are returned before a completion or crash. This is because of how long the tasks run relative to most other projects, and the number of things that can go wrong over the course of a long run that may have nothing to do with the stability of the computer it is running on.
ID: 56276 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 56277 - Posted: 21 May 2017, 14:51:56 UTC - in response to Message 56275.  
Last modified: 21 May 2017, 15:07:36 UTC

Another option is to load all used libs early, by doing so app would crash early (within 1st minute) so not much time will be lost.


In my experience, when tasks crash due to missing libs it is always within the first 20 seconds.

Edit: An exception seems to be the libz.so.1 which I can only assume is needed at time of creation of zips. I think, on my systems it has always been installed as part of getting the tasks to run at all, so I have never experienced that particular problem.
ID: 56277 · Report as offensive     Reply Quote
[B@P] Daniel

Send message
Joined: 4 Apr 17
Posts: 3
Credit: 2,709,972
RAC: 0
Message 56282 - Posted: 21 May 2017, 16:42:25 UTC - in response to Message 56277.  

Another option is to load all used libs early, by doing so app would crash early (within 1st minute) so not much time will be lost.


In my experience, when tasks crash due to missing libs it is always within the first 20 seconds.

Edit: An exception seems to be the libz.so.1 which I can only assume is needed at time of creation of zips. I think, on my systems it has always been installed as part of getting the tasks to run at all, so I have never experienced that particular problem.

zlib is used for many different things, so usually it is present - in fact all my systems have 64-bit version of that lib. Problem is with 32-bit libs on 64-bit systems, they are usually not present there and have to be installed manually.
ID: 56282 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,028,039
RAC: 20,189
Message 56378 - Posted: 13 Jun 2017, 8:25:26 UTC

Just got this on a retread as did one of the two previous attempts, the other was a missing lib problem.


Model crashed: DRLANDF1 : Error in READ_LAND_SEA.

https://www.cpdn.org/cpdnboinc/result.php?resultid=20449729

My guess is most of that batch have finished by now or failed.
ID: 56378 · Report as offensive     Reply Quote
Bernhard

Send message
Joined: 13 May 06
Posts: 1
Credit: 0
RAC: 0
Message 56414 - Posted: 21 Jun 2017, 16:34:47 UTC

Hallo,

all my work Units are failling with messages like

Outputfile .... absent


21-Jun-2017 18:17:52 [climateprediction.net] Computation for task wah2_sas50_n6qy_201612_8_591_011098479_0 finished
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_1.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_2.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_3.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_4.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_5.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_6.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_7.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_8.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_restart.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent
21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_out.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent

I am running boinc on fedora

Does any one hav a idea, how to fix the problem

Thanks in advance

Bernhard
ID: 56414 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56417 - Posted: 21 Jun 2017, 16:59:16 UTC - in response to Message 56414.  
Last modified: 22 Jun 2017, 0:55:59 UTC

The error message in stderr on the task pages for your account suggests a missing library.

cpdn models are 32bit. Nearly all linux distributions no longer ship 32bit libraries in the default install of a 64bit distribution. You will need to install some 32bit libraries in order for the models to work. This post in the boinc wiki may help.

https://boinc.berkeley.edu/wiki/Installing_BOINC#64_Bit_Considerations

I haven't run Fedora for a long time so I don't know how up to date those instructions are for the latest versions of Fedora.


You can also do a

ldd wah2*

in the BOINC/projects/climateprediction.net directory which should list the dependencies that are missing.

Edit...It looks like the command on the wiki page works for everything but one file. After you've done that, you can download the libstdc++.so.6 file in an rpm from rpmfind.
ftp://rpmfind.net/linux/fedora/linux/updates/25/x86_64/l/libstdc++-6.3.1-1.fc25.i686.rpm
ID: 56417 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Questions and Answers : Unix/Linux : Work units failing

©2024 cpdn.org