climateprediction.net home page
New work discussion - 2

New work discussion - 2

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 42 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 66134 - Posted: 23 Sep 2022, 8:02:36 UTC

There's quite a good discussion of this issue in https://github.com/BOINC/boinc/issues/2120
ID: 66134 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,976,682
RAC: 21,948
Message 66135 - Posted: 23 Sep 2022, 9:25:57 UTC - in response to Message 66134.  

There's quite a good discussion of this issue in https://github.com/BOINC/boinc/issues/2120


If there is interest from Andy, I could add
vsyscall=emulate
as a kernel boot parameter and run some tasks on the testing site. However we are never going to get the majority of Linux users to do that given the problems we have with the number who don't install 32bit libraries.
ID: 66135 · Report as offensive
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,787,901
RAC: 19,551
Message 66136 - Posted: 23 Sep 2022, 11:53:10 UTC - in response to Message 66135.  

However we are never going to get the majority of Linux users to do that given the problems we have with the number who don't install 32bit libraries.

That's so true. At the same time the availability of older macs is also very low so there'll be a lot of errors due to newer macs trying to run these tasks, same as Linux PCs not having 32bit libraries. It seems to me that it should be a relatively high priority for the project to find ways to identify and restrict non-compatible computers. Richard, do you know of ways to set up server side of BOINC to do something like this? Otherwise, like Glenn said earlier, the project might not look that appealing to scientists. Maybe it already doesn't and that's why we're not getting consistent work. The extremely long deadlines and the high failure rates, I wonder how scientists feel about the amount of data and how fast they're getting it from us.
ID: 66136 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 66137 - Posted: 23 Sep 2022, 14:03:54 UTC - in response to Message 66136.  

It would be a good idea, but I don't think that the standard BOINC server package reports the right sort of operating system information back to the server. 32-bit libraries, and vsyscall emulation, are both a bit niche.

My only suggestion would be that someone could write a small probe app that asserted that both features were in place, and reported back yay or nay - either to the user (probably not much use), or to the project. If that could be sent in place of a full task download, say at the start of each batch, it would save a great deal of time and bandwidth, by automatically inhibiting work send to misconfigured devices. There would need to be a route for "I've installed them - please retest".
ID: 66137 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,976,682
RAC: 21,948
Message 66138 - Posted: 23 Sep 2022, 14:11:35 UTC - in response to Message 66137.  

It would be a good idea, but I don't think that the standard BOINC server package reports the right sort of operating system information back to the server. 32-bit libraries, and vsyscall emulation, are both a bit niche.


In the case of the missing libraries, wouldn't a batch file to find the computers producing that error and then set the maximum number of work units/day to -1 be easier? vsyscall would be a bit (a lot?) more difficult. I have set the boot parameter on my box but the last batch before Linux tasks for hadcm3s were withdrawn, all the ones I ran from testing worked as did quite a few on main site, then, suddenly I got a run of failures with the seg violation. I don't understand what is going on enough to see the runs as more than a statistical anomaly.
ID: 66138 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 66139 - Posted: 23 Sep 2022, 16:21:53 UTC - in response to Message 66132.  

I've coded met models for 40yrs. Segv errors point to code only, nothing to do with hardware. And yes Fortran does allow pointers. But in this case it could also be an issue with the 32bit addressing going wrong. I doubt there is anything wrong with the code, it's the environment the model is running in that's the problem. This was my point earlier about making these models run on a wide range of systems.
ID: 66139 · Report as offensive
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 66140 - Posted: 23 Sep 2022, 18:18:04 UTC - in response to Message 66137.  

It would be a good idea, but I don't think that the standard BOINC server package reports the right sort of operating system information back to the server. 32-bit libraries, and vsyscall emulation, are both a bit niche.


For MacOS, it should be straightforward. The version of the OS is reported.

https://www.cpdn.org/show_host_detail.php?hostid=1526717 is one of my VMs.

Anything later than that, will not run the 32-bit binaries. The support is dropped. There's no "Maybe" there.

I've coded met models for 40yrs. Segv errors point to code only, nothing to do with hardware.


Eh, I've seen segfaults when I was abusing hardware into error conditions before. They're rare, usually the system just crashes, but if you're doing a kernel build and gcc starts throwing random segfaults, check your CPU's cooling fan... it may not be turning.
ID: 66140 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 66141 - Posted: 23 Sep 2022, 21:32:07 UTC - in response to Message 66140.  

A probe test might be possible at registration time, but it's creating more work. A simpler route might be blacklisting machines with repeated failures for a while. When I asked about high failure rates the view was that cpdn is so oversubscribed with volunteers it didn't justify spending time on, but I personally don't agree though I understand they are very under resourced.

The vsyscall issue looks interesting. I will read up on that. It would be nice if there was a single page of information on boinc with WSL, I've only seen bits on various pages and in forums. I note that WSL now supports systemctl.

As for segv, bus errors and the like I have seen them from hardware but they tend not to be repeatable. I'm talking about repeatable failures that I can trace in a debugger.
ID: 66141 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 66142 - Posted: 24 Sep 2022, 1:25:49 UTC - in response to Message 66141.  

The biggest problem may be Science United, which just pushes people onto all projects.
ID: 66142 · Report as offensive
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,787,901
RAC: 19,551
Message 66143 - Posted: 24 Sep 2022, 9:07:48 UTC - in response to Message 66141.  

That's what I had in mind too: if a machine returns say 5 tasks in a row as errors, blacklist it or at least restrict it to no more than 1 task a day until it can return completed tasks. Could something like that be automated with the existing BOINC server software?

Are there any current settings in BOINC that projects can use to restrict the number of tasks assigned per machine? Say, 1 per core per day or 5 per day per machine? Something like this could significantly slow down the serial crashers and not really impact regular users.

If the project is already oversubscribed then perhaps finding a way to automate the banning of mis-configured machines should be the focus and not making the Hadley models be able to run on a wider range of systems. I'd guess it should be easier to do also.

Science United is ridiculous. Almost 8700 machines are attached to CPDN and about 1200 of which are "active in past 30 days". The project only needs a few hundred at any given time. I wonder if projects can opt-out of Science United? CPDN is a prime example of one that should - it requires extra setup steps and only needs a small amount of volunteers. Perhaps it could ban Science United as a user?

WSL with systemd... just looked it up, so far it's only available as a preview in Windows 11. Hopefully it'll eventually be available in Windows 10. Maybe it'll fix the problem of not being able to run LHC native ATLAS tasks on more than one core. So far it's the only WSL BOINC problem I'm aware of that's unsolved.
ID: 66143 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66144 - Posted: 24 Sep 2022, 12:14:42 UTC - in response to Message 66143.  

WSL with systemd... just looked it up, so far it's only available as a preview in Windows 11.


I have been running my machine since new with systemd in late 2020.
The OS is Red Hat Enterprise Linux release 8.6 (Ootpa)
and it seems to run CPDN pretty well. It started with Red Hat Enterprise Linux release 8.1 (Ootpa)

By far, the greatest number of Error tasks were

UK Met Office HadCM3 short v8.36
i686-pc-linux-gnu

and 14 that died were segmentation errors. 10 completed successfully. (The others that died were negative theta errors.)

All tasks for computer 1511241

Next 20
State: All (287) · In progress (0) · Validation pending (0) · Validation inconclusive (0) · Valid (242) · Invalid (0) · Error (45)
Application: All (287) · OpenIFS 43r3 (0) · OpenIFS 43r3 ARM (0) · UK Met Office Coupled Model Full Resolution Ocean (0) · UK Met Office HadAM4 at N144 resolution (33) · UK Met Office HadAM4 at N216 resolution (177) · UK Met Office HadCM3 short (25) · UK Met Office HadSM4 at N144 resolution (52) · Weather At Home 2 (wah2) (0) · Weather At Home 2 (wah2) (region independent) (0) 

ID: 66144 · Report as offensive
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,951,048
RAC: 14,214
Message 66145 - Posted: 25 Sep 2022, 21:57:23 UTC - in response to Message 66143.  

That's what I had in mind too: if a machine returns say 5 tasks in a row as errors, blacklist it or at least restrict it to no more than 1 task a day until it can return completed tasks. Could something like that be automated with the existing BOINC server software?


Would it be possible to interrogate the task data - total tasks and errors - and blacklist machines that have greater than a certain percentage of errors?
ID: 66145 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,976,682
RAC: 21,948
Message 66146 - Posted: 27 Sep 2022, 11:26:17 UTC

Batch of 5 hadsm4's currently being run on testing. Estimated run time for my two on Ryzen7 is 20hours but that could be wildly out either way. No clues yet as to whether further tests will be needed before any main site work.
ID: 66146 · Report as offensive
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 66147 - Posted: 28 Sep 2022, 23:50:16 UTC

I admittedly have no idea how hard this would be, but perhaps it would be beneficial to have some zero credit "test units" that call the required libraries and have to be returned successfully before any real work is sent. As they're simply checking the system environment they could be recycled, with the rate they are sent reducing quickly on errors to one a day, and the logs just showing what was checked for and if the checks passed or failed, until the user gets their stuff in order. I know I've previously run into the situation where I thought I had the required libraries installed but in reality did not have everything which resulted in a number of errors on real work.
ID: 66147 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66150 - Posted: 29 Sep 2022, 4:29:22 UTC - in response to Message 66147.  

I know I've previously run into the situation where I thought I had the required libraries installed but in reality did not have everything which resulted in a number of errors on real work.


I had that problem once.

Work Units like

$ ldd hadam4_8.09_i686-pc-linux-gnu
linux-gate.so.1 (0xf7f6f000)
libpthread.so.0 => /lib/libpthread.so.0 (0xf7f39000)
libdl.so.2 => /lib/libdl.so.2 (0xf7f34000)
libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7da1000)
libm.so.6 => /lib/libm.so.6 (0xf7ccf000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7cb2000)
libc.so.6 => /lib/libc.so.6 (0xf7b0a000)
/lib/ld-linux.so.2 (0xf7f71000)
$ ldd hadam4_se_8.09_i686-pc-linux-gnu.so
linux-gate.so.1 (0xf7fc0000)
libnsl.so.1 => /lib/libnsl.so.1 (0xf7e71000)
libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7cde000)
libm.so.6 => /lib/libm.so.6 (0xf7c0c000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7bef000)
libc.so.6 => /lib/libc.so.6 (0xf7a47000)
/lib/ld-linux.so.2 (0xf7fc2000)

hadam4_8.52_i686-pc-linux-gnu
hadam4_se_8.52_i686-pc-linux-gnu.so

Have all they need provided libnsl.so.1 => /lib/libnsl.so.1 (0xf7e71000) and libstdc++.so.6 => /lib/libstdc++.so.6 are in there

But these need one more:

$ ldd hadcm3s_8.36_i686-pc-linux-gnu
linux-gate.so.1 (0xf7f4a000)
libpthread.so.0 => /lib/libpthread.so.0 (0xf7f14000)
libdl.so.2 => /lib/libdl.so.2 (0xf7f0f000)
libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7d7c000)
libm.so.6 => /lib/libm.so.6 (0xf7caa000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7c8d000)
libc.so.6 => /lib/libc.so.6 (0xf7ae5000)
/lib/ld-linux.so.2 (0xf7f4c000)
$ ldd hadcm3s_se_8.36_i686-pc-linux-gnu.so
linux-gate.so.1 (0xf7f14000)
libz.so.1 => /lib/libz.so.1 (0xf7e5b000)
libnsl.so.1 => /lib/libnsl.so.1 (0xf7e3f000)
libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7cac000)
libm.so.6 => /lib/libm.so.6 (0xf7bda000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7bbd000)
libc.so.6 => /lib/libc.so.6 (0xf7a15000)
/lib/ld-linux.so.2 (0xf7f16000)

libz.so.1 => /lib/libz.so.1 (0xf7e5b000) Has to be there.
ID: 66150 · Report as offensive
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 66151 - Posted: 29 Sep 2022, 5:10:27 UTC - in response to Message 66150.  

Sweet, I have that last one as well.
ID: 66151 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66152 - Posted: 29 Sep 2022, 10:33:22 UTC - in response to Message 66151.  

Sweet, I have that last one as well.


In Linux, ldd is your friend.
ID: 66152 · Report as offensive
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 66153 - Posted: 29 Sep 2022, 11:46:04 UTC - in response to Message 66152.  

Sweet, I have that last one as well.


In Linux, ldd is your friend.


So long as you have something to run it against, yes.
ID: 66153 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66154 - Posted: 29 Sep 2022, 14:08:14 UTC - in response to Message 66153.  

In Linux, ldd is your friend.

So long as you have something to run it against, yes.


True, but at least on my machine, I have lots. IIRC, I have not done any work since late July 2022. I do not know if the server can delete any, or if I must do it.

Mar 31  2021 hadsm4_8.02_i686-pc-linux-gnu
Jan 18  2021 hadsm4_se_8.02_i686-pc-linux-gnu.so
Jan 18  2021 hadsm4_um_8.02_i686-pc-linux-gnu

Dec 18  2021 hadam4_8.09_i686-pc-linux-gnu
May  1  2019 hadam4_se_8.09_i686-pc-linux-gnu.so
May  1  2019 hadam4_um_8.09_i686-pc-linux-gnu

Dec 18  2021 hadcm3s_8.36_i686-pc-linux-gnu
Jun 10  2019 hadcm3s_se_8.36_i686-pc-linux-gnu.so
Jun 10  2019 hadcm3s_um_8.36_i686-pc-linux-gnu

Dec 18  2021 hadam4_8.52_i686-pc-linux-gnu
May  1  2019 hadam4_se_8.52_i686-pc-linux-gnu.so
May  1  2019 hadam4_um_8.52_i686-pc-linux-gnu

ID: 66154 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,976,682
RAC: 21,948
Message 66171 - Posted: 4 Oct 2022, 9:57:58 UTC

More HADCM3s tasks in testing. Now October is here, I am checking daily for when the OpenIFS start testing but suspect it won't be before mid month.
ID: 66171 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org