Questions and Answers :
Unix/Linux :
I keep getting Client Errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
I am running BOINC 4.19, and Climate Prediction hadsm3um_4.13_i686-pc-linux-gnu. This seems to return Client Error almost always, though it grants me credit. Is this normal, or is something the matter. Here is my recent result list: 866211 564132 22 May 2005 10:55:07 UTC --- In Progress Unknown New 40560.00 94.52 761356 506335 16 Apr 2005 21:48:34 UTC 22 May 2005 17:19:55 UTC Over Client error Computing 1975228.26 4536.84 720723 480141 8 Apr 2005 1:43:48 UTC --- In Progress Unknown New 3008721.00 6805.26 708403 461438 5 Apr 2005 8:31:58 UTC 15 May 2005 13:56:36 UTC Over Success Done 2983085.82 6805.26 700305 471527 3 Apr 2005 12:21:39 UTC 8 Apr 2005 1:43:48 UTC Over Client error Computing 370877.60 850.66 679377 451406 30 Mar 2005 10:22:57 UTC 5 Apr 2005 8:31:57 UTC Over Client error Computing 489617.30 1134.21 677569 433534 29 Mar 2005 13:17:15 UTC 3 Apr 2005 12:21:39 UTC Over Client error Computing 413430.60 945.18 662762 449135 24 Mar 2005 6:16:10 UTC 29 Mar 2005 13:17:08 UTC Over Client error Computing 393875.16 945.18 652342 438851 22 Mar 2005 5:12:24 UTC 29 Mar 2005 13:17:08 UTC Over Client error Done 411095.97 945.18 639173 426016 19 Mar 2005 16:55:05 UTC 24 Mar 2005 6:16:10 UTC Over Client error Computing 379601.33 1134.21 637329 424926 18 Mar 2005 4:14:31 UTC 22 Mar 2005 5:12:24 UTC Over Client error Computing 341817.22 1134.21 621259 413681 12 Mar 2005 22:43:54 UTC 18 Mar 2005 4:14:31 UTC Over Client error Computing 426647.81 945.18 620913 412400 12 Mar 2005 20:06:32 UTC 18 Mar 2005 19:28:20 UTC Over Client error Computing 485309.29 1039.69 603440 403766 7 Mar 2005 21:49:37 UTC 12 Mar 2005 22:43:52 UTC Over Client error Computing 400145.98 945.18 599686 399710 6 Mar 2005 13:32:14 UTC 12 Mar 2005 20:06:32 UTC Over Client error Computing 486399.02 1039.69 593043 396887 4 Mar 2005 12:12:34 UTC 7 Mar 2005 21:49:36 UTC Over Client error Computing 260941.51 945.18 |
Send message Joined: 7 Aug 04 Posts: 2183 Credit: 64,822,615 RAC: 5,275 |
Hmmm. Looks like you were doing quite well until the unstable versions in March were downloaded. All versions that crashed from March into early April were likely due to unstable hadsm versions with Linux. However, in those since, you are sometimes getting a 251 error, which may be indicative of a hardware problem. Is cooling perhaps a problem? Dust inside the computer? The main problem I've had with 4.13 in Linux is lost trickles, but that appears to be a database problem, not necessarily a Linux problem. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
> Hmmm. Looks like you were doing quite well until the unstable versions in > March were downloaded. All versions that crashed from March into early April > were likely due to unstable hadsm versions with Linux. However, in those > since, you are sometimes getting a 251 error, which may be indicative of a > hardware problem. Is cooling perhaps a problem? Dust inside the computer? My computer has 13 cooling fans. A 120mmx25mm, an 80mmx25mm, and a 40mmx25mm intake fan in the bottom of the tower. Each 3.06 Xeon processor has an Intel-supplied 60mmx38mm variable speed cooling fan temperature controlled. There is also a 90mmx25mm exhaust fan on the bottom level of the full tower. At the top level of the tower is an 80mmx25mm exhaust fan, two 80mm fans (in series) in the power supply, two 80mmx25mm intake fans that cool the hard drives and two 80mmx25mm exhaust fans that cool the hard drives. According to the sensors software package, the system temperature runs around 40C and the processors run around 53C. Intel says you should not run them over 70C, so I don't. They have never been anywhere near that hot. The Intel-supplied fans are running around 2500rpm at the moment, but in the hottest of last summer they got up around 5000rpm. The intake fans on the bottom get their air from a plenum that has an air filter (cleaned monthly), and inspection of the inside of the box at intervals reveals only a bit of very fine dust that escapes the filter. So I do not think there is a cooling problem. Instead, when I look at the stderr listing in the results list, I see a lot of these: 4.19 process exited with code 251 (0xfb) 1 0 No heartbeat from core client for 31 sec - exiting No heartbeat from core client for 31 sec - exiting So whatever causes heartbeats to be lost for 31 seconds seems to be the problem. I have no idea what heartbeats are, so I cannot tell what it means. If it is just that a higher priority task hogs one of the processors for that long, that would surprise me because I load the processors very highly very seldom (not every day, for sure: perhaps once a week). These are two hyperthreaded Xeon processors after all, and running Red Hat Linux 3 ES with their kernel-smp-2.4.21-27.0.4.EL kernel-smp-2.4.21-32.EL kernels. The first one for a while, and the second one starting 3 days ago. > > The main problem I've had with 4.13 in Linux is lost trickles, but that > appears to be a database problem, not necessarily a Linux problem. > |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
Do you start BOINC with nohup or just from a terminal session that might send some signal to the process on closing the TTY? Afaik. BOINC itself doesn't detach from the TTY. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
> Do you start BOINC with nohup or just from a terminal session that might send > some signal to the process on closing the TTY? Afaik. BOINC itself doesn't > detach from the TTY. > Neither. I use the following script to start BOINC at boot time. BTW, BOINC is also running setiathome and protein-folding and they do not get this error. This script is in /etc/rc.d/init.d, with suitable links in /etc/rc5.d (and so on). I deleted the rest of the script. Note that it also includes /etc/sysconfig/boinc which follows. $ cat boinc #!/bin/bash # # Red Hat Linux start/stop script to run the BOINC client in background # at system startup, as the boinc user (not root). # # chkconfig: 345 71 29 # description: start boinc client at boot time # processname: boinc # config: /etc/sysconfig/boinc # # Eric Myers - 27 July 2004 # Department of Physics and Astronomy, Vassar College, Poughkeepsie NY # @(#) $Revision: 1.5 $ -- $Date: 2004/07/27 14:43:24 $ declare -i CLIENT_PID PATH=/sbin:/bin:/usr/sbin:/usr/bin export PATH # Source function library. . /etc/rc.d/init.d/functions # Defaults, which can be overridden by /etc/sysconfig/boinc BOINCUSER=boinc BOINCDIR=/home/boinc BUILD_ARCH=i686-pc-linux-gnu LOGFILE=boinc.log ERRORLOG=error.log if [ -f /etc/sysconfig/boinc ]; then . /etc/sysconfig/boinc fi ## Locate the working directory if [ ! -d $BOINCDIR ]; then echo "Cannot find boinc directory $BOINCDIR " exit 1 fi ## Locate the executable with highest version BOINCEXE=`/bin/ls -1 $BOINCDIR/boinc_*_$BUILD_ARCH 2>/dev/null | tail -n 1 ` if [ ! -x "$BOINCEXE" ]; then echo "Cannot find/run boinc executable $BOINCEXE " exit 2 fi ## Functions: start/stop/status/restart case "$1" in start) cd $BOINCDIR if [ ! -f client_state.xml ] ; then echo -n "BOINC client requires initialization first." echo_failure echo exit 3 fi echo -n "Starting BOINC client: " # su - boinc -c "$BOINCEXE >>$BOINCDIR/$LOGFILE 2>>$BOINCDIR/$ERRORLOG &" su - boinc -c "$BOINCEXE >/dev/null 2>>$BOINCDIR/$ERRORLOG &" echo ;; $ cat boinc # Configuration for boinc client. BOINCUSER=boinc BOINCDIR=/boinc BUILD_ARCH=i686-pc-linux-gnu LOGFILE=boinc.log ERRORLOG=error.logtrillian:jdbeyer |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
> Hmmm. Looks like you were doing quite well until the unstable versions in > March were downloaded. All versions that crashed from March into early April > were likely due to unstable hadsm versions with Linux. However, in those > since, you are sometimes getting a 251 error, which may be indicative of a > hardware problem. Is cooling perhaps a problem? Dust inside the computer? > > The main problem I've had with 4.13 in Linux is lost trickles, but that > appears to be a database problem, not necessarily a Linux problem. > Let me add that I have another machine, running Red Hat Linux 9 with all the updates they have. It too, gets these "no heartbeat" messages and exits with 251 error messages. So it does not depend on the hardware, the OS. This seems to go on forever. If I am wasting my time running climate-prediction, I might as well quit and let my machines do more work on setiathome and protein-folding. Is no one else experiencing this problem? Is anyone working on solving it? |
©2024 cpdn.org