Questions and Answers : Unix/Linux : Failing tasks with exit code 12 and 25
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Nov 15 Posts: 8 Credit: 310,778 RAC: 0 |
I have a bunch of tasks on one of my computers which failed with exit code 12 or 25. On ones with exit code 12 I see an error like: checkdir: cannot create extraction directory: hadam4h_a21t_209911_4_867_012014556 File exists On ones with exit code 25 I see a bunch of errors like: Could not read directory attributes: Value too large for defined data type or checkdir error: cannot create hadam4h_a0wt_209411_4_868_012016230/datain/ancil/ctldata File exists unable to process datain/ancil/ctldata/stasets/. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024 |
After looking at several of the tasks I found a few others failing with similar but not the exact same messages as you had. Took a while because most of the failures were due to missing 32bit libs. The fact that others also failed with similar errors suggests a problem with the tasks. I will let the project know. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024 |
I have seen this error before looking at tasks that have failed for other reasons. I am clutching at straws a bit here but a couple of things worth checking. 1 That you have enough disk space allocated. Unlikely this is a problem with only 8 cores.) 2. Something to do with Ram and or cache memory. If the tasks complete fine when you restrict BOINC to only 4 cores at a time then cache memory would be the most likely reason. Sarah at the project replied to my post and thinks the strange error messages you are getting are likely not directly from the crash but because something doesn't clean up properly after the crash. 3. Just thought of this, it could be that they are crashing because you have a corrupted file downloaded. If you detach from CPDN then re-attach that will download fresh copies of all the relevant files and resolve the problem. (Might be worth trying that one first.) It isn't that common an issue I think as I have never seen it on my own boxes and only rarely when looking through crashed tasks looking for patterns. |
Send message Joined: 9 Nov 15 Posts: 8 Credit: 310,778 RAC: 0 |
1. Boinc is using 6.7 GB, and it says it has another 125.46 GB available. 2. I have already restricted Boinc to 1 core 3. I have now detached and reattached the project, but it says communication deferred 1 day, so we'll have to wait and see what happens. |
Send message Joined: 9 Nov 15 Posts: 8 Credit: 310,778 RAC: 0 |
I'm still getting the same errors after detaching and reattaching. |
Send message Joined: 9 Nov 15 Posts: 8 Credit: 310,778 RAC: 0 |
I have an idea about what I think is causing this. The computer getting these errors have the BOINC directory on XFS, which uses 64-bit inode numbers, but CPDN seems to be 32-bit, and by default in 32-bit applications, the stat() and readdir() functions use 32-bit inode numbers, hence the: Could not read directory attributes: Value too large for defined data type To fix this, CPDN needs to be compiled with _FILE_OFFSET_BITS=64 or use stat64 and readdir64; or even better, compiled as 64-bit. See https://www.mjr19.org.uk/sw/inodes64.html for a longer explanation. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The Met Office programs used by the researchers, are all 32 bit code, and will stay that way for historical comparison of data. It's up to users to make sure that their computers have the necessary 32 bit libraries. See '*** Running 32bit CPDN from 64bit Linux - Important *** ar the top of this Linux section for how to do this for various Linux versions. |
Send message Joined: 9 Nov 15 Posts: 8 Credit: 310,778 RAC: 0 |
The problem here has nothing to do with missing libraries. The problem is that XFS uses 64-bit inode numbers, so the stat and readdir system calls returns 64-bit inode numbers, but hadam4_8.52_i686-pc-linux-gnu uses the old stat and readdir functions which only work for 32-bit inodes. It's not something you can fix just by installing extra dependencies. If it can't be 64-bit (and unless it's poorly written, that shouldn't change the data) then that leaves the other two workarounds I mentioned: Compile CPDN with _FILE_OFFSET_BITS=64. Unless the inode numbers are actually used for anything, this should not change anything else. Replace calls to stat and readdir with stat64 and readdir64.
|
Send message Joined: 9 Nov 15 Posts: 8 Credit: 310,778 RAC: 0 |
Or you can put it in VirtualBox, as some other projects have done, which should avoid both file system and library issues. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024 |
Or you can put it in VirtualBox, as some other projects have done, which should avoid both file system and library issues. Virtual box would as you say solve the problems of systems missing the 32bit libraries. However, there would be some performance hit and from time to time over on the BOINC boards I see users who have problems with it so it adds another layer where problems might occur. I don't know how straightforward it would be for the people at the project to set up virtual box for the Linux applications either and whether anyone there has experience of doing so. Because of other reasons, I am going to do a clean install of Ubuntu on my laptop when the work currently on it is finished and will try using XFS to test it but, it is not a fast machine so it is likely to be over a month till I do so. If there is anyone here using XFS who is either running tasks successfully or has the same problem it would be good if you could post to help us sort this one out and at least either confirm or disprove that XFS is the root of the problem. Edit:If the XFS file system does prevent things working, I am a bit surprised nothing has come up on the BOINC forums when I did a search there. |
Send message Joined: 9 Nov 15 Posts: 8 Credit: 310,778 RAC: 0 |
Keep in mind that that 64-bit inodes are only used for file systems bigger than 1 TiB, and AFAIK only Inodes that are not in the first 1 TiB of the drive will be too large to fit in 32-bit. So the problem likely won't happen on a near empty file system, and it will never happen on file systems smaller than 1 TiB. Testing it may require filling the file system with 1 TiB of data first. I'm not sure how common large XFS file systems are, especially for /var/lib, where AFAIK the boinc directory is by default on most distros. So this could be a pretty rare issue. But it based on you first comment, it does seem to happen sometimes. |
Send message Joined: 9 Nov 15 Posts: 8 Credit: 310,778 RAC: 0 |
LD_PRELOAD seems to work. What I did was: Compile inode64.c from the link above based on the instructions in that file. Place it at /usr/local/lib/inode64.so Add LD_PRELOAD=/usr/local/lib/inode64.so to /etc/sysconfig/boinc-client (EnvironmentFile in the boinc-client systemd service points to this).
|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024 |
Thanks for posting a solution. I got confirmation from Richard, over on the BOINC forums that this almost certainly was a problem with the CPDN setup. I have informed the project so see if someone knows how to fix it their end. I worked out after I last posted that with my system disk only being a 500MB SSD and my data disk being 1GB mechanical that I probably wouldn't see the problem. I will post your solution over on the BOINC forums in case anyone who reads them needs it. |
©2024 cpdn.org