Message boards : Number crunching : New work discussion - 2
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 42 · Next
Author | Message |
---|---|
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,709,106 RAC: 9,121 |
There's quite a good discussion of this issue in https://github.com/BOINC/boinc/issues/2120 |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
There's quite a good discussion of this issue in https://github.com/BOINC/boinc/issues/2120 If there is interest from Andy, I could add vsyscall=emulateas a kernel boot parameter and run some tasks on the testing site. However we are never going to get the majority of Linux users to do that given the problems we have with the number who don't install 32bit libraries. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,852,553 RAC: 19,917 |
However we are never going to get the majority of Linux users to do that given the problems we have with the number who don't install 32bit libraries. That's so true. At the same time the availability of older macs is also very low so there'll be a lot of errors due to newer macs trying to run these tasks, same as Linux PCs not having 32bit libraries. It seems to me that it should be a relatively high priority for the project to find ways to identify and restrict non-compatible computers. Richard, do you know of ways to set up server side of BOINC to do something like this? Otherwise, like Glenn said earlier, the project might not look that appealing to scientists. Maybe it already doesn't and that's why we're not getting consistent work. The extremely long deadlines and the high failure rates, I wonder how scientists feel about the amount of data and how fast they're getting it from us. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,709,106 RAC: 9,121 |
It would be a good idea, but I don't think that the standard BOINC server package reports the right sort of operating system information back to the server. 32-bit libraries, and vsyscall emulation, are both a bit niche. My only suggestion would be that someone could write a small probe app that asserted that both features were in place, and reported back yay or nay - either to the user (probably not much use), or to the project. If that could be sent in place of a full task download, say at the start of each batch, it would save a great deal of time and bandwidth, by automatically inhibiting work send to misconfigured devices. There would need to be a route for "I've installed them - please retest". |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
It would be a good idea, but I don't think that the standard BOINC server package reports the right sort of operating system information back to the server. 32-bit libraries, and vsyscall emulation, are both a bit niche. In the case of the missing libraries, wouldn't a batch file to find the computers producing that error and then set the maximum number of work units/day to -1 be easier? vsyscall would be a bit (a lot?) more difficult. I have set the boot parameter on my box but the last batch before Linux tasks for hadcm3s were withdrawn, all the ones I ran from testing worked as did quite a few on main site, then, suddenly I got a run of failures with the seg violation. I don't understand what is going on enough to see the runs as more than a statistical anomaly. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I've coded met models for 40yrs. Segv errors point to code only, nothing to do with hardware. And yes Fortran does allow pointers. But in this case it could also be an issue with the 32bit addressing going wrong. I doubt there is anything wrong with the code, it's the environment the model is running in that's the problem. This was my point earlier about making these models run on a wide range of systems. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
It would be a good idea, but I don't think that the standard BOINC server package reports the right sort of operating system information back to the server. 32-bit libraries, and vsyscall emulation, are both a bit niche. For MacOS, it should be straightforward. The version of the OS is reported. https://www.cpdn.org/show_host_detail.php?hostid=1526717 is one of my VMs. Anything later than that, will not run the 32-bit binaries. The support is dropped. There's no "Maybe" there. I've coded met models for 40yrs. Segv errors point to code only, nothing to do with hardware. Eh, I've seen segfaults when I was abusing hardware into error conditions before. They're rare, usually the system just crashes, but if you're doing a kernel build and gcc starts throwing random segfaults, check your CPU's cooling fan... it may not be turning. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
A probe test might be possible at registration time, but it's creating more work. A simpler route might be blacklisting machines with repeated failures for a while. When I asked about high failure rates the view was that cpdn is so oversubscribed with volunteers it didn't justify spending time on, but I personally don't agree though I understand they are very under resourced. The vsyscall issue looks interesting. I will read up on that. It would be nice if there was a single page of information on boinc with WSL, I've only seen bits on various pages and in forums. I note that WSL now supports systemctl. As for segv, bus errors and the like I have seen them from hardware but they tend not to be repeatable. I'm talking about repeatable failures that I can trace in a debugger. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The biggest problem may be Science United, which just pushes people onto all projects. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,852,553 RAC: 19,917 |
That's what I had in mind too: if a machine returns say 5 tasks in a row as errors, blacklist it or at least restrict it to no more than 1 task a day until it can return completed tasks. Could something like that be automated with the existing BOINC server software? Are there any current settings in BOINC that projects can use to restrict the number of tasks assigned per machine? Say, 1 per core per day or 5 per day per machine? Something like this could significantly slow down the serial crashers and not really impact regular users. If the project is already oversubscribed then perhaps finding a way to automate the banning of mis-configured machines should be the focus and not making the Hadley models be able to run on a wider range of systems. I'd guess it should be easier to do also. Science United is ridiculous. Almost 8700 machines are attached to CPDN and about 1200 of which are "active in past 30 days". The project only needs a few hundred at any given time. I wonder if projects can opt-out of Science United? CPDN is a prime example of one that should - it requires extra setup steps and only needs a small amount of volunteers. Perhaps it could ban Science United as a user? WSL with systemd... just looked it up, so far it's only available as a preview in Windows 11. Hopefully it'll eventually be available in Windows 10. Maybe it'll fix the problem of not being able to run LHC native ATLAS tasks on more than one core. So far it's the only WSL BOINC problem I'm aware of that's unsolved. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
WSL with systemd... just looked it up, so far it's only available as a preview in Windows 11. I have been running my machine since new with systemd in late 2020. The OS is Red Hat Enterprise Linux release 8.6 (Ootpa) and it seems to run CPDN pretty well. It started with Red Hat Enterprise Linux release 8.1 (Ootpa) By far, the greatest number of Error tasks were UK Met Office HadCM3 short v8.36 i686-pc-linux-gnu and 14 that died were segmentation errors. 10 completed successfully. (The others that died were negative theta errors.) All tasks for computer 1511241 Next 20 State: All (287) · In progress (0) · Validation pending (0) · Validation inconclusive (0) · Valid (242) · Invalid (0) · Error (45) Application: All (287) · OpenIFS 43r3 (0) · OpenIFS 43r3 ARM (0) · UK Met Office Coupled Model Full Resolution Ocean (0) · UK Met Office HadAM4 at N144 resolution (33) · UK Met Office HadAM4 at N216 resolution (177) · UK Met Office HadCM3 short (25) · UK Met Office HadSM4 at N144 resolution (52) · Weather At Home 2 (wah2) (0) · Weather At Home 2 (wah2) (region independent) (0) |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638 |
That's what I had in mind too: if a machine returns say 5 tasks in a row as errors, blacklist it or at least restrict it to no more than 1 task a day until it can return completed tasks. Could something like that be automated with the existing BOINC server software? Would it be possible to interrogate the task data - total tasks and errors - and blacklist machines that have greater than a certain percentage of errors? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
Batch of 5 hadsm4's currently being run on testing. Estimated run time for my two on Ryzen7 is 20hours but that could be wildly out either way. No clues yet as to whether further tests will be needed before any main site work. |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
I admittedly have no idea how hard this would be, but perhaps it would be beneficial to have some zero credit "test units" that call the required libraries and have to be returned successfully before any real work is sent. As they're simply checking the system environment they could be recycled, with the rate they are sent reducing quickly on errors to one a day, and the logs just showing what was checked for and if the checks passed or failed, until the user gets their stuff in order. I know I've previously run into the situation where I thought I had the required libraries installed but in reality did not have everything which resulted in a number of errors on real work. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I know I've previously run into the situation where I thought I had the required libraries installed but in reality did not have everything which resulted in a number of errors on real work. I had that problem once. Work Units like $ ldd hadam4_8.09_i686-pc-linux-gnu linux-gate.so.1 (0xf7f6f000) libpthread.so.0 => /lib/libpthread.so.0 (0xf7f39000) libdl.so.2 => /lib/libdl.so.2 (0xf7f34000) libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7da1000) libm.so.6 => /lib/libm.so.6 (0xf7ccf000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7cb2000) libc.so.6 => /lib/libc.so.6 (0xf7b0a000) /lib/ld-linux.so.2 (0xf7f71000) $ ldd hadam4_se_8.09_i686-pc-linux-gnu.so linux-gate.so.1 (0xf7fc0000) libnsl.so.1 => /lib/libnsl.so.1 (0xf7e71000) libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7cde000) libm.so.6 => /lib/libm.so.6 (0xf7c0c000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7bef000) libc.so.6 => /lib/libc.so.6 (0xf7a47000) /lib/ld-linux.so.2 (0xf7fc2000) hadam4_8.52_i686-pc-linux-gnu hadam4_se_8.52_i686-pc-linux-gnu.so Have all they need provided libnsl.so.1 => /lib/libnsl.so.1 (0xf7e71000) and libstdc++.so.6 => /lib/libstdc++.so.6 are in there But these need one more: $ ldd hadcm3s_8.36_i686-pc-linux-gnu linux-gate.so.1 (0xf7f4a000) libpthread.so.0 => /lib/libpthread.so.0 (0xf7f14000) libdl.so.2 => /lib/libdl.so.2 (0xf7f0f000) libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7d7c000) libm.so.6 => /lib/libm.so.6 (0xf7caa000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7c8d000) libc.so.6 => /lib/libc.so.6 (0xf7ae5000) /lib/ld-linux.so.2 (0xf7f4c000) $ ldd hadcm3s_se_8.36_i686-pc-linux-gnu.so linux-gate.so.1 (0xf7f14000) libz.so.1 => /lib/libz.so.1 (0xf7e5b000) libnsl.so.1 => /lib/libnsl.so.1 (0xf7e3f000) libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7cac000) libm.so.6 => /lib/libm.so.6 (0xf7bda000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7bbd000) libc.so.6 => /lib/libc.so.6 (0xf7a15000) /lib/ld-linux.so.2 (0xf7f16000) libz.so.1 => /lib/libz.so.1 (0xf7e5b000) Has to be there. |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
Sweet, I have that last one as well. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Sweet, I have that last one as well. In Linux, ldd is your friend. |
Send message Joined: 31 May 18 Posts: 53 Credit: 4,725,987 RAC: 9,174 |
Sweet, I have that last one as well. So long as you have something to run it against, yes. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
In Linux, ldd is your friend. True, but at least on my machine, I have lots. IIRC, I have not done any work since late July 2022. I do not know if the server can delete any, or if I must do it. Mar 31 2021 hadsm4_8.02_i686-pc-linux-gnu Jan 18 2021 hadsm4_se_8.02_i686-pc-linux-gnu.so Jan 18 2021 hadsm4_um_8.02_i686-pc-linux-gnu Dec 18 2021 hadam4_8.09_i686-pc-linux-gnu May 1 2019 hadam4_se_8.09_i686-pc-linux-gnu.so May 1 2019 hadam4_um_8.09_i686-pc-linux-gnu Dec 18 2021 hadcm3s_8.36_i686-pc-linux-gnu Jun 10 2019 hadcm3s_se_8.36_i686-pc-linux-gnu.so Jun 10 2019 hadcm3s_um_8.36_i686-pc-linux-gnu Dec 18 2021 hadam4_8.52_i686-pc-linux-gnu May 1 2019 hadam4_se_8.52_i686-pc-linux-gnu.so May 1 2019 hadam4_um_8.52_i686-pc-linux-gnu |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
More HADCM3s tasks in testing. Now October is here, I am checking daily for when the OpenIFS start testing but suspect it won't be before mid month. |
©2024 cpdn.org