climateprediction.net (CPDN) home page
Thread 'Computation Errors'

Thread 'Computation Errors'

Message boards : Number crunching : Computation Errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 65673 - Posted: 20 Jul 2022, 22:16:29 UTC

Climate models like lots of L3 cache.
ID: 65673 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 65675 - Posted: 23 Jul 2022, 6:23:11 UTC

I just downloaded 4 SM4 work units to see if my missing library problem (libnsl.so.1) has been fixed and I can get some successful work done.

I found that they were all re-sends from others who failed with a different missing library libstdc++.so.6

Computer 1460610 - Bartosz Toczek only started having this issue with SM4, no probs with AM4.

Computer 1531595 - Anonymous has 79 failures all SM4

Computer 1532546 - Science United has 30 failure all SM4

And of course I also got Computer 1517479 - Eric Korpela over 11,000 failures and I thibk they could be permission problems? but I am not sure he isn't showing that library as missing.

Conan
ID: 65675 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 65676 - Posted: 23 Jul 2022, 11:03:18 UTC - in response to Message 65675.  

And of course I also got Computer 1517479 - Eric Korpela over 11,000 failures and I thibk they could be permission problems? but I am not sure he isn't showing that library as missing.


From his stderr messages, there seems to be no way to tell if libraries re missing or not. Similarly permission problems.
Everything is missing, or unreadable or unexecutable.

Somehow I got the idea that he keeps his files on a file server, and it has been reorganized, or it has just been moved to a file server, or moved from a file server to somewhere else, so the boinc client cannot find any of the files. I do not remember where I got that idea, so I certainly could be wrong. But whatever the reason, he should have noticed problems after one or two attempts. It seem irresponsible to go through 11000 work units and not notice he is getting no credits. And for some reason four or five are very old (but under one year old) and still "in progress."

Note that some of those bold files cannot exist, but would be executable files if they did. .zip.zip!!!!

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 12 (0xc, -244)</message>
<stderr_txt>
unzip: cannot find or open /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip, /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip.zip or /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.zip.ZIP.
unzip: cannot find or open /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip, /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip.zip or /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu.zip.ZIP.
unzip: cannot find or open hadsm4_data_8.02_i686-pc-linux-gnu.zip, hadsm4_data_8.02_i686-pc-linux-gnu.zip.zip or hadsm4_data_8.02_i686-pc-linux-gnu.zip.ZIP.
unzip: cannot find or open hadsm4_a0q5_201310_6_933_012144403.zip, hadsm4_a0q5_201310_6_933_012144403.zip.zip or hadsm4_a0q5_201310_6_933_012144403.zip.ZIP.
cpdnmonitor: cannot open input file /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_se_8.02_i686-pc-linux-gnu.so after 11 attempts
cpdnmonitor: cannot open input file /mydisks/a/boinc_lib/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-linux-gnu after 11 attempts

</stderr_txt>
]]>
ID: 65676 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,033,903
RAC: 14,766
Message 65678 - Posted: 24 Jul 2022, 22:46:20 UTC - in response to Message 65676.  

"Somehow I got the idea that he keeps his files on a file server, and it has been reorganized, or it has just been moved to a file server, or moved from a file server to somewhere else, so the boinc client cannot find any of the files. I do not remember where I got that idea, so I certainly could be wrong."

That could be something I said in an earlier post. The file structure that is giving the error is fairly typical of the centrally based file systems that institutions use so that users can access their files from any computer within a certain group. It could just be a permissions problem but more likely that the BOINC client hasn't been set up to see that folder as the default location. I expect the event log would make interesting reading.
ID: 65678 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 65679 - Posted: 25 Jul 2022, 0:28:19 UTC - in response to Message 65678.  
Last modified: 25 Jul 2022, 0:29:44 UTC

I expect the event log would make interesting reading.


It sure would. Too bad the user never looks at it. Is he not even curious that he has obtained no credit for years of work?

Maybe he died and no one has found out, or even turned off his machine.
ID: 65679 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 65680 - Posted: 25 Jul 2022, 2:22:03 UTC - in response to Message 65679.  

I expect the event log would make interesting reading.


It sure would. Too bad the user never looks at it. Is he not even curious that he has obtained no credit for years of work?

Maybe he died and no one has found out, or even turned off his machine.


Last contact was today 25th July 2022, so not turned off.

He has another computer ID 1517679 with same problem and over 6,100 failures.

But he has this computer ID 1517434 that is connecting, downloading and processing work, it does get errors but he does get credits as well as it sends back trickles.

Conan
ID: 65680 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 65681 - Posted: 25 Jul 2022, 16:54:28 UTC - in response to Message 65678.  

It could just be a permissions problem but more likely that the BOINC client hasn't been set up to see that folder as the default location. I expect the event log would make interesting reading.


I had another idea: his machine(s) are running, but his file server is down.
ID: 65681 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 65682 - Posted: 25 Jul 2022, 17:48:40 UTC

This is annoying. I was running 8 HadSM4 at N144 on a Ryzen 5700X, using 50% of the virtual cores to provide enough cache.
This was running under WSL on Windows 10, and I had no problem, even rebooting without errors. They were running fine, and estimated to complete in 3 days 4 hours.

But just short of 3 days, they all failed.
https://www.cpdn.org/results.php?hostid=1533683

It looks like everyone else who errored out on them did so after a short period of time, so they probably did not have the libraries installed.
So I don't know if it is a problem with the work units, or with WSL.
ID: 65682 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,884,880
RAC: 19,188
Message 65685 - Posted: 25 Jul 2022, 21:18:03 UTC - in response to Message 65682.  

I think the problem is with the reboots or BOINC client restarts. The more often the tasks are interrupted the more likely they're to fail. Just because the tasks can survive a restart sometimes doesn't mean they should be "tempted". It looks like those tasks have already experienced 2 restarts. I'm guessing they errored out after a 3rd interruption. This has happened to me (interruptions were unintentional so it was very annoying).

I use WSL2 under Windows 10 also and find it to be very stable for BOINC.
ID: 65685 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 65686 - Posted: 25 Jul 2022, 21:55:47 UTC - in response to Message 65685.  

It looks like those tasks have already experienced 2 restarts. I'm guessing they errored out after a 3rd interruption. This has happened to me (interruptions were unintentional so it was very annoying).

I suspect WSL, but it had not been rebooted all day. So it was something else.
(You have to reboot a lot in quick succession to get it to fail that way. I think it usually happens when notebooks come out of hibernation frequently.)

I have seen strange errors with WSL before, but don't know if they are limited to CPDN or affect all projects. I will try something else.
ID: 65686 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 65708 - Posted: 30 Jul 2022, 19:13:31 UTC - in response to Message 65659.  


But I got so many from machine
All tasks for computer 1517479
that I looked up that machine, and it fails everything it attempts.


Computer ID 1517679 is another machine of his with 6219 failures.
ID: 65708 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : Computation Errors

©2024 cpdn.org