Message boards : Number crunching : Computation Errors
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Climate models like lots of L3 cache. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
I just downloaded 4 SM4 work units to see if my missing library problem (libnsl.so.1) has been fixed and I can get some successful work done. I found that they were all re-sends from others who failed with a different missing library libstdc++.so.6 Computer 1460610 - Bartosz Toczek only started having this issue with SM4, no probs with AM4. Computer 1531595 - Anonymous has 79 failures all SM4 Computer 1532546 - Science United has 30 failure all SM4 And of course I also got Computer 1517479 - Eric Korpela over 11,000 failures and I thibk they could be permission problems? but I am not sure he isn't showing that library as missing. Conan |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
And of course I also got Computer 1517479 - Eric Korpela over 11,000 failures and I thibk they could be permission problems? but I am not sure he isn't showing that library as missing. From his stderr messages, there seems to be no way to tell if libraries re missing or not. Similarly permission problems. Everything is missing, or unreadable or unexecutable. Somehow I got the idea that he keeps his files on a file server, and it has been reorganized, or it has just been moved to a file server, or moved from a file server to somewhere else, so the boinc client cannot find any of the files. I do not remember where I got that idea, so I certainly could be wrong. But whatever the reason, he should have noticed problems after one or two attempts. It seem irresponsible to go through 11000 work units and not notice he is getting no credits. And for some reason four or five are very old (but under one year old) and still "in progress." Note that some of those bold files cannot exist, but would be executable files if they did. .zip.zip!!!! <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 12 (0xc, -244)</message> <stderr_txt> unzip: cannot find or open /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip, /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip.zip or /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip.ZIP. unzip: cannot find or open /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip, /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip.zip or /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip.ZIP. unzip: cannot find or open hadsm4_data_8.02_i686-pc-linux-gnu.zip, hadsm4_data_8.02_i686-pc-linux-gnu.zip.zip or hadsm4_data_8.02_i686-pc-linux-gnu.zip.ZIP. unzip: cannot find or open hadsm4_a0q5_201310_6_933_012144403.zip, hadsm4_a0q5_201310_6_933_012144403.zip.zip or hadsm4_a0q5_201310_6_933_012144403.zip.ZIP. cpdnmonitor: cannot open input file /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.so after 11 attempts cpdnmonitor: cannot open input file /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu after 11 attempts </stderr_txt> ]]> |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,033,903 RAC: 14,766 |
"Somehow I got the idea that he keeps his files on a file server, and it has been reorganized, or it has just been moved to a file server, or moved from a file server to somewhere else, so the boinc client cannot find any of the files. I do not remember where I got that idea, so I certainly could be wrong." That could be something I said in an earlier post. The file structure that is giving the error is fairly typical of the centrally based file systems that institutions use so that users can access their files from any computer within a certain group. It could just be a permissions problem but more likely that the BOINC client hasn't been set up to see that folder as the default location. I expect the event log would make interesting reading. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I expect the event log would make interesting reading. It sure would. Too bad the user never looks at it. Is he not even curious that he has obtained no credit for years of work? Maybe he died and no one has found out, or even turned off his machine. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
I expect the event log would make interesting reading. Last contact was today 25th July 2022, so not turned off. He has another computer ID 1517679 with same problem and over 6,100 failures. But he has this computer ID 1517434 that is connecting, downloading and processing work, it does get errors but he does get credits as well as it sends back trickles. Conan |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
It could just be a permissions problem but more likely that the BOINC client hasn't been set up to see that folder as the default location. I expect the event log would make interesting reading. I had another idea: his machine(s) are running, but his file server is down. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
This is annoying. I was running 8 HadSM4 at N144 on a Ryzen 5700X, using 50% of the virtual cores to provide enough cache. This was running under WSL on Windows 10, and I had no problem, even rebooting without errors. They were running fine, and estimated to complete in 3 days 4 hours. But just short of 3 days, they all failed. https://www.cpdn.org/results.php?hostid=1533683 It looks like everyone else who errored out on them did so after a short period of time, so they probably did not have the libraries installed. So I don't know if it is a problem with the work units, or with WSL. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,884,880 RAC: 19,188 |
I think the problem is with the reboots or BOINC client restarts. The more often the tasks are interrupted the more likely they're to fail. Just because the tasks can survive a restart sometimes doesn't mean they should be "tempted". It looks like those tasks have already experienced 2 restarts. I'm guessing they errored out after a 3rd interruption. This has happened to me (interruptions were unintentional so it was very annoying). I use WSL2 under Windows 10 also and find it to be very stable for BOINC. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
It looks like those tasks have already experienced 2 restarts. I'm guessing they errored out after a 3rd interruption. This has happened to me (interruptions were unintentional so it was very annoying). I suspect WSL, but it had not been rebooted all day. So it was something else. (You have to reboot a lot in quick succession to get it to fail that way. I think it usually happens when notebooks come out of hibernation frequently.) I have seen strange errors with WSL before, but don't know if they are limited to CPDN or affect all projects. I will try something else. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Computer ID 1517679 is another machine of his with 6219 failures. |
©2024 cpdn.org