Thread 'My first four tasks failed on new machine.'

Author	Message
Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63064 - Posted: 2 Dec 2020, 11:45:04 UTC I just completed my first four tasks on my new machine running Red Hat Enterprise Linux 8. Each completed 5 trickles and did 5 uploads. They seem to have the right libraries: [root@localhost climateprediction.net]# ldd hadam4_um_8.52_i686-pc-linux-gnu linux-gate.so.1 (0xf7ef4000) libdl.so.2 => /lib/libdl.so.2 (0xf7ed4000) libm.so.6 => /lib/libm.so.6 (0xf7e02000) libpthread.so.0 => /lib/libpthread.so.0 (0xf7de1000) libc.so.6 => /lib/libc.so.6 (0xf7c3a000) /lib/ld-linux.so.2 (0xf7ef6000) [root@localhost climateprediction.net]# ldd hadam4_se_8.52_i686-pc-linux-gnu.so linux-gate.so.1 (0xf7f4a000) libnsl.so.1 => /lib/libnsl.so.1 (0xf7df5000) libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7c62000) libm.so.6 => /lib/libm.so.6 (0xf7b90000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7b73000) libc.so.6 => /lib/libc.so.6 (0xf79cc000) /lib/ld-linux.so.2 (0xf7f4c000) tasks are 219230, 21967787, 21973109, and 21973090. They report REPLANCS IO error for the first one and zipfile not found for the last three. Is it you, or is it me? I assume it is me, but how should I fix this? I have 12 more work units and four of them are just beginning. ID: 63064 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63065 - Posted: 2 Dec 2020, 11:55:50 UTC - in response to Message 63064. tasks are 219230, 21967787, 21973109, and 21973090. OOPS: first task is 21923056. ID: 63065 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024	Message 63066 - Posted: 2 Dec 2020, 12:09:02 UTC Just looked at the task pages, new one on me so afraid I can't offer any help. Doesn't look like missing library problem to me though. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63066 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 63068 - Posted: 2 Dec 2020, 16:15:20 UTC - in response to Message 63064. It looks like some glitch when the three failed models tried to access the .so file for creating the first zip file. Only the first monthly file failed to upload, the others monthly zips for those three models did upload. I can't know what caused it for only the first month for the 3 tasks, but we can hope it was a one time deal. Likely the pertinent lines from stderr from the failed models' task pages: Unable to load library hadam4_se_8.52_i686-pc-linux-gnu.so dlopen error: /var/lib/boinc/projects/climateprediction.net/hadam4_se_8.52_i686-pc-linux-gnu.so: cannot restore segment prot after reloc: Permission denied ID: 63068 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63070 - Posted: 2 Dec 2020, 19:06:30 UTC - in response to Message 63068. It looks like some glitch when the three failed models tried to access the .so file for creating the first zip file. Only the first monthly file failed to upload, the others monthly zips for those three models did upload. I can't know what caused it for only the first month for the 3 tasks, but we can hope it was a one time deal. Likely the pertinent lines from stderr from the failed models' task pages: Unable to load library hadam4_se_8.52_i686-pc-linux-gnu.so dlopen error: /var/lib/boinc/projects/climateprediction.net/hadam4_se_8.52_i686-pc-linux-gnu.so: cannot restore segment prot after reloc: Permission denied 1.) Red Hat Enterprise Linux 8 (RHEL8) that I am running uses an SELinux operating system kernel that is very fussy about who can do what. If a program tries to violate one of the restrictions, the operation is aborted. On the other hand, so did RHEL6 that I used to use. 2.) RHEL8 uses a very different approach to initiating daemon processes, such as the BOINC client. (systemd). In the past, made a separate disk partition and mounted i as just another user (/home./boinc). But with RHEL8, it goes in /var/lib/boinc. 3.) It seems that SELinux permissions are a little different, and it may have violated one of the rules. It is a shame that it took about two weeks time for the problems to show up. In any case, looking at system logs indicated a possible problem, so I changed a rule. So I will find out sometime. ID: 63070 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024	Message 63071 - Posted: 2 Dec 2020, 19:13:50 UTC - in response to Message 63070. It looks like some glitch when the three failed models tried to access the .so file for creating the first zip file. Only the first monthly file failed to upload, the others monthly zips for those three models did upload. I can't know what caused it for only the first month for the 3 tasks, but we can hope it was a one time deal. Likely the pertinent lines from stderr from the failed models' task pages: Unable to load library hadam4_se_8.52_i686-pc-linux-gnu.so dlopen error: /var/lib/boinc/projects/climateprediction.net/hadam4_se_8.52_i686-pc-linux-gnu.so: cannot restore segment prot after reloc: Permission denied 1.) Red Hat Enterprise Linux 8 (RHEL8) that I am running uses an SELinux operating system kernel that is very fussy about who can do what. If a program tries to violate one of the restrictions, the operation is aborted. On the other hand, so did RHEL6 that I used to use. 2.) RHEL8 uses a very different approach to initiating daemon processes, such as the BOINC client. (systemd). In the past, made a separate disk partition and mounted i as just another user (/home./boinc). But with RHEL8, it goes in /var/lib/boinc. 3.) It seems that SELinux permissions are a little different, and it may have violated one of the rules. It is a shame that it took about two weeks time for the problems to show up. In any case, looking at system logs indicated a possible problem, so I changed a rule. So I will find out sometime. Do post the answer if this does turn out to be the issue and I will put something in the BOINC fora as well. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63071 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 63072 - Posted: 2 Dec 2020, 19:17:21 UTC It may have run into memory problems, if the models all tried to do the same thing at once. You may need double what you have. Look at the Properties of one that's running, and see have much memory it's using. ID: 63072 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63073 - Posted: 2 Dec 2020, 22:27:53 UTC - in response to Message 63072. It may have run into memory problems, if the models all tried to do the same thing at once. You may need double what you have. Look at the Properties of one that's running, and see have much memory it's using. I currently have 64 Gigiabytes or RAM and the four instances running are taking 1.3 Gbytes each. Once they get going, this goes up to 1.4 GBytes each. top - 17:25:01 up 1 day, 14 min, 1 user, load average: 13.71, 13.52, 13.46 Tasks: 482 total, 14 running, 467 sleeping, 0 stopped, 1 zombie %Cpu(s): 3.0 us, 0.5 sy, 80.6 ni, 15.4 id, 0.1 wa, 0.3 hi, 0.0 si, 0.0 st MiB Mem : 63944.0 total, 12570.7 free, 11707.1 used, 39666.2 buff/cache MiB Swap: 15992.0 total, 15992.0 free, 0.0 used. 51386.3 avail Mem PID PPID USER PR NI S RES SHR %MEM %CPU P TIME+ COMMAND 25133 25130 boinc 39 19 R 1.3g 19972 2.1 6.0 4 753:19.10 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ 22591 22588 boinc 39 19 R 1.3g 19872 2.1 6.2 0 829:50.82 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ 25164 25162 boinc 39 19 R 1.3g 19984 2.1 6.2 15 751:17.98 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ 5527 5508 boinc 39 19 R 1.3g 19932 2.1 6.2 14 1438:55 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ ID: 63073 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024	Message 63074 - Posted: 2 Dec 2020, 23:02:26 UTC I currently have 64 Gigiabytes or RAM and the four instances running are taking 1.3 Gbytes each. Once they get going, this goes up to 1.4 GBytes each. How much level 3 cache does that CPU have? These tasks take a fraction over 4MB each at peak I think. Another good reason for staggering tasks a bit so they don't peak at the same time though my main reason is my upload bandwidth! Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63074 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 63079 - Posted: 3 Dec 2020, 3:35:37 UTC - in response to Message 63073. Your task list shows 12 running, so I got that wrong. It looks like you're right about an OS setting. Touchy version of Linux. :( ID: 63079 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63080 - Posted: 3 Dec 2020, 4:34:09 UTC - in response to Message 63074. How much level 3 cache does that CPU have? These tasks take a fraction over 4MB each at peak I think. Another good reason for staggering tasks a bit so they don't peak at the same time though my main reason is my upload bandwidth! It has more L3 cach than my first computer, and IBM 704, had more main memory. That is why I am limiting my working to only 4 CPDN processes at a time, 4 WCG processes at a time, two Rosetta processes at a time, and three Universe@home processes at a time. This is a little too much and I propose to lower some of the others a bit after the initial starting transient dies down. This may take several weeks. Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.2 (Ootpa) [4.18.0-193.28.1.el8_2.x86_64\|libc 2.28 (GNU libc)] BOINC version 7.16.6 Memory 62.45 GB Cache 16896 KB <---<<< Swap space 15.62 GB Total disk space 117.21 GB Free Disk Space 68.38 GB Measured floating point speed 5.84 billion ops/sec Measured integer speed 22.79 billion ops/sec Average upload rate 107.97 KB/sec Average download rate 10660.89 KB/sec Average turnaround time 11.87 days ID: 63080 · Reply Quote

ian Dell white Send message Joined: 6 Dec 15 Posts: 3 Credit: 4,487,311 RAC: 0	Message 65092 - Posted: 6 Feb 2022, 17:20:38 UTC Problem : Many failures of workloads with similar problems Computer : Specification OS - 20.04 Ubuntu Memory 32GB Cores 8 - Hyper threading switched off BOINC Manager - 7.16.6 I think l have copied below what the system is trying to tell me : Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... Signal 3 received, exiting... 05:56:26 (1444): called boinc_finish(193) </stderr_txt> ]]> I have been very careful switching the system off at night and allow Virtualbox box to exit before switching the computer off. Anyone have any answers out there. Thanks for your help ID: 65092 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,016,442 RAC: 21,024	Message 65094 - Posted: 6 Feb 2022, 17:51:11 UTC - in response to Message 65092. I have been very careful switching the system off at night and allow Virtualbox box to exit before switching the computer off. You allow VB to exit. I presume you have also stopped the boinc tasks from running and exited boinc first. Secondly, closing the computer down and restarting increases the failure rate even without using virtualisation. I don't know if running in VB increases the risk of shutting down causing failure? ID: 65094 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 65100 - Posted: 6 Feb 2022, 20:39:45 UTC - in response to Message 63064. Last modified: 6 Feb 2022, 21:40:29 UTC "REPLANCA error" is when some physical result of the series of calculations has returned something way outside of normal. e.g. the atmosphere has heated to a million degrees, or the oceans have all disappeared. There are value constraints for numerous parameters which will stop the program if exceeded. So just normal for some climate models. (I haven't had one of these for a few years now.) ID: 65100 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,342,480 RAC: 10,485	Message 65137 - Posted: 10 Feb 2022, 17:54:15 UTC - in response to Message 65094. Last modified: 10 Feb 2022, 17:55:45 UTC I have been very careful switching the system off at night and allow Virtualbox box to exit before switching the computer off. You allow VB to exit. I presume you have also stopped the boinc tasks from running and exited boinc first. Secondly, closing the computer down and restarting increases the failure rate even without using virtualisation. I don't know if running in VB increases the risk of shutting down causing failure? Unless the Virtual Machine/CPDN is shut down cleanly, the loss rate is quite high. We found an automatic Windoze update would typically crash 75% of the running CPDN tasks, Precautions to take with a Windoze10 host:running Oracle VirtualBox VM (ubuntu) with BOINC/CPDN. 1. Pause Windows updates (settings, advanced options). The maximum delay in Win10 is 35 days. 1a. If you don't do this, Windoze updates will silently reboot your PC when it's least expected and WILL crash the VM with your long-running CPDN tasks. 1b. At least once a month, carefully close down CPDN/BOINC/VM and run the Windoze updates. 2. Close down CPDN cleanly. 2a. Suspend activity in BOINC manager. 2b. Check that your tasks have been 'suspended'. 2c. 'Power off' the Virtual Machine. 2d. Close down VirtualBox. 3. Then resume the Windoze update to check for updates and let it run. 3a. When Windoze update has done it's thing, pause the updates again (1) 3b. We usually do a post-update power-off and reboot. 4. Restart VirtualBox and the VM. 4a. It's debatable if the VM and CPDN will start cleanly, You may need to power-off the VM and restart. We found that, sometimes, it takes a post-update PC power-off and re-boot (hence 3b). 4b.. Cross your fingers. In the VM, open BOINC Manager, Activity, Restart and check the tasks are 'running'. Keep your fingers crossed, as that's the point when you'll find out if you have a full set of running tasks or three crashes.. 4c.. If you look in the Event Log, it may not report the activity restart, just the suspend. Good luck. Hope this helps. ID: 65137 · Reply Quote

ian Dell white Send message Joined: 6 Dec 15 Posts: 3 Credit: 4,487,311 RAC: 0	Message 65152 - Posted: 13 Feb 2022, 17:45:24 UTC - in response to Message 65094. Thanks for the clues Ian Dell white ID: 65152 · Reply Quote

ian Dell white Send message Joined: 6 Dec 15 Posts: 3 Credit: 4,487,311 RAC: 0	Message 65153 - Posted: 13 Feb 2022, 17:45:57 UTC - in response to Message 65137. Thanks for the clues Regards Ian Dell White ID: 65153 · Reply Quote