Questions and Answers : Unix/Linux : Work units failing
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
cat /proc/<pid>/limits cat /proc/2083/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 18446744071931166720 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 15592 15592 processes Max open files 1024 4096 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 15592 15592 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Will also look at stdout |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
stdout_rm.text for one of two tasks currently running. Starting HadRM3 model for ID# 233635... Attaching shared memory segment... Setting run-time Fortran environment... UM environment variables in use: ASTART=dataout/region_restart.day UNIT11=dataout/xacxf.phist UM_SECTOR_SIZE=2048 UNIT02=jobs/xacxf UM_LBC_COUP=0 VN=4.5 TYPE=CRUN UNIT09=tmp/xacxf.namelists UNIT22=datain/ancil/ctldata/STASHmaster STASETS_DIR=datain/ancil/ctldata/stasets CACHE2=tmp/xacxf.cache2 UNIT08=tmp/xacxf.pipe_dummy UNIT14=tmp/xacxf.errors APSUM1=tmp/xacxf.apsum1 APSTMP1=tmp/xacxf.apstmp1 AOTRANS=tmp/xacxf.aotrans UNIT04=jobs/xacxf.stashc UNIT05=jobs/xacxf.namelists DATAM=dataout/ UNIT12=dataout/xacxf.thist UNIT10=dataout/xacxf.phist UNIT06=dataout/xacxf.out UNIT00=dataout/xacxf.err AINITIAL=dataout/region_restart.day UNIT57=jobs/spec3a_sw_3_asol2c_hadcm3 UNIT80=jobs/spec3a_lw_3_asol2c_hadcm3 SWSPECTD=jobs/spec3a_sw_3_asol2c_hadcm3 LWSPECTD=jobs/spec3a_lw_3_asol2c_hadcm3 Changing to slots dir /var/lib/boinc-client/slots/1 |
Send message Joined: 18 Feb 06 Posts: 21 Credit: 128,450 RAC: 0 |
Identical to what I see on my system. Not sure what happens when you set such large stack size limit. |
Send message Joined: 6 Aug 04 Posts: 124 Credit: 9,195,838 RAC: 0 |
Is it significant that I have libc-2.23 (which is working) and Knorr has libc-2.24? It's not an Ubuntu problem, I've the same problem with Gentoo. Would be interesting to know, on what system the binaries were compiled. Linux Users Everywhere @ BOINC |
Send message Joined: 6 Aug 04 Posts: 124 Credit: 9,195,838 RAC: 0 |
The new batch of work units are mass failing either after a few seconds or after restarting computation. But this time with a new error: <message> finish file present too long </message> Linux Users Everywhere @ BOINC |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
The new batch of work units are mass failing either after a few seconds or after restarting computation. But this time with a new error: And not just under Linux. Message has been sent to project. Edit: And Sarah has replied that it is about 400 of this batch of 7200 which seem to have a corrupted restart file. If it becomes apparent that it is a bigger problem, then they will look at whether the batch should be withdrawn. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
Yours was batch 569. I have just had two fail just before creation of first upload on batch 568 both with SIGSEGV: segmentation violation both within 20seconds of the same cpu time. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
And batch 564 2 just fell over at start with <stderr_txt> forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /var/lib/boinc-client/projects/climateprediction.net/hadcm3s_83ys_201412_120_564_011007447/jobs/climate.cpdc, line 396, position 20 Image PC Routine Line Source |
Send message Joined: 6 Aug 04 Posts: 124 Credit: 9,195,838 RAC: 0 |
What's with this 3 units, they all ended with completed, but got 0 credits. https://www.cpdn.org/cpdnboinc/result.php?resultid=20393719 https://www.cpdn.org/cpdnboinc/result.php?resultid=20399859 https://www.cpdn.org/cpdnboinc/result.php?resultid=20401131 Linux Users Everywhere @ BOINC |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
What's with this 3 units, they all ended with completed, but got 0 credits. The credits don't appear till after the next run of the credit script which is only once a week these days. |
Send message Joined: 4 Apr 17 Posts: 3 Credit: 2,709,972 RAC: 0 |
I got bunch of WU failures recently. It turned out that 32-bit version of libz.so.1 was missing on my systems. After installing it "ldd wah2_se_8.25_i686-pc-linux-gnu.so" stopped complaining about missing dependency. Unfortunately all my computing time is lost here. It would be good if your libs would be statically linked with all needed libs to avoid such problems, especially when you use 32-bit apps on 64-bit systems. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
I and all of the forum admins would agree with you. At least for Ubuntu there are clear instructions on this forum for installing the requisite 32bit libs which hasn't always been the case. I think the problem is in ensuring that the libs will work across the various flavours of Linux. However it will require an initiate of higher degree than myself to confirm that this is or possibly is not an issue. |
Send message Joined: 4 Apr 17 Posts: 3 Credit: 2,709,972 RAC: 0 |
Usually static linking of all required libs is enough, most BOINC projects do so. One problem may be with linking with old versiops of glibc, Einstein@Home and Albert@home created their ARM apps by linking it with some very old version of glibc, so it needs some symbol which is no longer present in glibc version installed on Odroids. Beside this other differences between Linuxes usually means different platform string, so project can prepare another app for them. Another option is to load all used libs early, by doing so app would crash early (within 1st minute) so not much time will be lost. BTW, today I noticed that I got credits for some of this failed WUs. I do not know why, but thanks for this. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
BTW, today I noticed that I got credits for some of this failed WUs. I do not know why, but thanks for this. CPDN awards credits based on how many trickles are returned before a completion or crash. This is because of how long the tasks run relative to most other projects, and the number of things that can go wrong over the course of a long run that may have nothing to do with the stability of the computer it is running on. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
Another option is to load all used libs early, by doing so app would crash early (within 1st minute) so not much time will be lost. In my experience, when tasks crash due to missing libs it is always within the first 20 seconds. Edit: An exception seems to be the libz.so.1 which I can only assume is needed at time of creation of zips. I think, on my systems it has always been installed as part of getting the tasks to run at all, so I have never experienced that particular problem. |
Send message Joined: 4 Apr 17 Posts: 3 Credit: 2,709,972 RAC: 0 |
Another option is to load all used libs early, by doing so app would crash early (within 1st minute) so not much time will be lost. zlib is used for many different things, so usually it is present - in fact all my systems have 64-bit version of that lib. Problem is with 32-bit libs on 64-bit systems, they are usually not present there and have to be installed manually. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
Just got this on a retread as did one of the two previous attempts, the other was a missing lib problem. Model crashed: DRLANDF1 : Error in READ_LAND_SEA. https://www.cpdn.org/cpdnboinc/result.php?resultid=20449729 My guess is most of that batch have finished by now or failed. |
Send message Joined: 13 May 06 Posts: 1 Credit: 0 RAC: 0 |
Hallo, all my work Units are failling with messages like Outputfile .... absent 21-Jun-2017 18:17:52 [climateprediction.net] Computation for task wah2_sas50_n6qy_201612_8_591_011098479_0 finished 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_1.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_2.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_3.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_4.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_5.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_6.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_7.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_8.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_restart.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent 21-Jun-2017 18:17:52 [climateprediction.net] Output file wah2_sas50_n6qy_201612_8_591_011098479_0_r1137055161_out.zip for task wah2_sas50_n6qy_201612_8_591_011098479_0 absent I am running boinc on fedora Does any one hav a idea, how to fix the problem Thanks in advance Bernhard |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
The error message in stderr on the task pages for your account suggests a missing library. cpdn models are 32bit. Nearly all linux distributions no longer ship 32bit libraries in the default install of a 64bit distribution. You will need to install some 32bit libraries in order for the models to work. This post in the boinc wiki may help. https://boinc.berkeley.edu/wiki/Installing_BOINC#64_Bit_Considerations I haven't run Fedora for a long time so I don't know how up to date those instructions are for the latest versions of Fedora. You can also do a ldd wah2* in the BOINC/projects/climateprediction.net directory which should list the dependencies that are missing. Edit...It looks like the command on the wiki page works for everything but one file. After you've done that, you can download the libstdc++.so.6 file in an rpm from rpmfind. ftp://rpmfind.net/linux/fedora/linux/updates/25/x86_64/l/libstdc++-6.3.1-1.fc25.i686.rpm |
©2024 cpdn.org