climateprediction.net home page
I keep getting Client Errors

I keep getting Client Errors

Questions and Answers : Unix/Linux : I keep getting Client Errors
Message board moderation

To post messages, you must log in.

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 12794 - Posted: 23 May 2005, 11:00:59 UTC

I am running BOINC 4.19, and Climate Prediction hadsm3um_4.13_i686-pc-linux-gnu. This seems to return Client Error almost always, though it grants me credit. Is this normal, or is something the matter. Here is my recent result list:

866211 564132 22 May 2005 10:55:07 UTC --- In Progress Unknown New 40560.00 94.52
761356 506335 16 Apr 2005 21:48:34 UTC 22 May 2005 17:19:55 UTC Over Client error Computing 1975228.26 4536.84
720723 480141 8 Apr 2005 1:43:48 UTC --- In Progress Unknown New 3008721.00 6805.26
708403 461438 5 Apr 2005 8:31:58 UTC 15 May 2005 13:56:36 UTC Over Success Done 2983085.82 6805.26
700305 471527 3 Apr 2005 12:21:39 UTC 8 Apr 2005 1:43:48 UTC Over Client error Computing 370877.60 850.66
679377 451406 30 Mar 2005 10:22:57 UTC 5 Apr 2005 8:31:57 UTC Over Client error Computing 489617.30 1134.21
677569 433534 29 Mar 2005 13:17:15 UTC 3 Apr 2005 12:21:39 UTC Over Client error Computing 413430.60 945.18
662762 449135 24 Mar 2005 6:16:10 UTC 29 Mar 2005 13:17:08 UTC Over Client error Computing 393875.16 945.18
652342 438851 22 Mar 2005 5:12:24 UTC 29 Mar 2005 13:17:08 UTC Over Client error Done 411095.97 945.18
639173 426016 19 Mar 2005 16:55:05 UTC 24 Mar 2005 6:16:10 UTC Over Client error Computing 379601.33 1134.21
637329 424926 18 Mar 2005 4:14:31 UTC 22 Mar 2005 5:12:24 UTC Over Client error Computing 341817.22 1134.21
621259 413681 12 Mar 2005 22:43:54 UTC 18 Mar 2005 4:14:31 UTC Over Client error Computing 426647.81 945.18
620913 412400 12 Mar 2005 20:06:32 UTC 18 Mar 2005 19:28:20 UTC Over Client error Computing 485309.29 1039.69
603440 403766 7 Mar 2005 21:49:37 UTC 12 Mar 2005 22:43:52 UTC Over Client error Computing 400145.98 945.18
599686 399710 6 Mar 2005 13:32:14 UTC 12 Mar 2005 20:06:32 UTC Over Client error Computing 486399.02 1039.69
593043 396887 4 Mar 2005 12:12:34 UTC 7 Mar 2005 21:49:36 UTC Over Client error Computing 260941.51 945.18
ID: 12794 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2183
Credit: 64,822,615
RAC: 5,275
Message 12799 - Posted: 23 May 2005, 13:56:31 UTC

Hmmm. Looks like you were doing quite well until the unstable versions in March were downloaded. All versions that crashed from March into early April were likely due to unstable hadsm versions with Linux. However, in those since, you are sometimes getting a 251 error, which may be indicative of a hardware problem. Is cooling perhaps a problem? Dust inside the computer?

The main problem I've had with 4.13 in Linux is lost trickles, but that appears to be a database problem, not necessarily a Linux problem.
ID: 12799 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 12803 - Posted: 23 May 2005, 16:24:39 UTC - in response to Message 12799.  
Last modified: 23 May 2005, 16:25:48 UTC

> Hmmm. Looks like you were doing quite well until the unstable versions in
> March were downloaded. All versions that crashed from March into early April
> were likely due to unstable hadsm versions with Linux. However, in those
> since, you are sometimes getting a 251 error, which may be indicative of a
> hardware problem. Is cooling perhaps a problem? Dust inside the computer?

My computer has 13 cooling fans. A 120mmx25mm, an 80mmx25mm, and a 40mmx25mm intake fan in the bottom of the tower.
Each 3.06 Xeon processor has an Intel-supplied 60mmx38mm variable speed cooling fan temperature controlled. There is also a 90mmx25mm exhaust fan on the bottom level of the full tower. At the top level of the tower is an 80mmx25mm exhaust fan, two 80mm fans (in series) in the power supply, two 80mmx25mm intake fans that cool the hard drives and two 80mmx25mm exhaust fans that cool the hard drives.

According to the sensors software package, the system temperature runs around 40C and the processors run around 53C. Intel says you should not run them over 70C, so I don't. They have never been anywhere near that hot. The Intel-supplied fans are running around 2500rpm at the moment, but in the hottest of last summer they got up around 5000rpm.

The intake fans on the bottom get their air from a plenum that has an air filter (cleaned monthly), and inspection of the inside of the box at intervals reveals only a bit of very fine dust that escapes the filter.

So I do not think there is a cooling problem.

Instead, when I look at the stderr listing in the results list, I see a lot of these:

4.19
process exited with code 251 (0xfb)

1
0

No heartbeat from core client for 31 sec - exiting
No heartbeat from core client for 31 sec - exiting



So whatever causes heartbeats to be lost for 31 seconds seems to be the problem. I have no idea what heartbeats are, so I cannot tell what it means. If it is just that a higher priority task hogs one of the processors for that long, that would surprise me because I load the processors very highly very seldom (not every day, for sure: perhaps once a week). These are two hyperthreaded Xeon processors after all, and running Red Hat Linux 3 ES with their

kernel-smp-2.4.21-27.0.4.EL
kernel-smp-2.4.21-32.EL

kernels. The first one for a while, and the second one starting 3 days ago.

>
> The main problem I've had with 4.13 in Linux is lost trickles, but that
> appears to be a database problem, not necessarily a Linux problem.
>
ID: 12803 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 12819 - Posted: 24 May 2005, 7:04:06 UTC

Do you start BOINC with nohup or just from a terminal session that might send some signal to the process on closing the TTY? Afaik. BOINC itself doesn't detach from the TTY.
ID: 12819 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 12824 - Posted: 24 May 2005, 9:59:36 UTC - in response to Message 12819.  

> Do you start BOINC with nohup or just from a terminal session that might send
> some signal to the process on closing the TTY? Afaik. BOINC itself doesn't
> detach from the TTY.
>
Neither. I use the following script to start BOINC at boot time. BTW, BOINC is also running setiathome and protein-folding and they do not get this error. This script is in /etc/rc.d/init.d, with suitable links in /etc/rc5.d (and so on). I deleted the rest of the script. Note that it also includes /etc/sysconfig/boinc which follows.

$ cat boinc
#!/bin/bash
#
# Red Hat Linux start/stop script to run the BOINC client in background
# at system startup, as the boinc user (not root).
#
# chkconfig: 345 71 29
# description: start boinc client at boot time
# processname: boinc
# config: /etc/sysconfig/boinc
#
# Eric Myers - 27 July 2004
# Department of Physics and Astronomy, Vassar College, Poughkeepsie NY
# @(#) $Revision: 1.5 $ -- $Date: 2004/07/27 14:43:24 $

declare -i CLIENT_PID

PATH=/sbin:/bin:/usr/sbin:/usr/bin
export PATH

# Source function library.
. /etc/rc.d/init.d/functions


# Defaults, which can be overridden by /etc/sysconfig/boinc

BOINCUSER=boinc
BOINCDIR=/home/boinc
BUILD_ARCH=i686-pc-linux-gnu
LOGFILE=boinc.log
ERRORLOG=error.log

if [ -f /etc/sysconfig/boinc ]; then
. /etc/sysconfig/boinc
fi

## Locate the working directory

if [ ! -d $BOINCDIR ]; then
echo "Cannot find boinc directory $BOINCDIR "
exit 1
fi


## Locate the executable with highest version

BOINCEXE=`/bin/ls -1 $BOINCDIR/boinc_*_$BUILD_ARCH 2>/dev/null | tail -n 1 `
if [ ! -x "$BOINCEXE" ]; then
echo "Cannot find/run boinc executable $BOINCEXE "
exit 2
fi


## Functions: start/stop/status/restart

case "$1" in
start)
cd $BOINCDIR
if [ ! -f client_state.xml ] ; then
echo -n "BOINC client requires initialization first."
echo_failure
echo
exit 3
fi
echo -n "Starting BOINC client: "
# su - boinc -c "$BOINCEXE >>$BOINCDIR/$LOGFILE 2>>$BOINCDIR/$ERRORLOG &"
su - boinc -c "$BOINCEXE >/dev/null 2>>$BOINCDIR/$ERRORLOG &"
echo
;;


$ cat boinc
# Configuration for boinc client.

BOINCUSER=boinc
BOINCDIR=/boinc
BUILD_ARCH=i686-pc-linux-gnu
LOGFILE=boinc.log
ERRORLOG=error.logtrillian:jdbeyer
ID: 12824 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,193,804
RAC: 2,852
Message 13052 - Posted: 2 Jun 2005, 9:37:22 UTC - in response to Message 12799.  

> Hmmm. Looks like you were doing quite well until the unstable versions in
> March were downloaded. All versions that crashed from March into early April
> were likely due to unstable hadsm versions with Linux. However, in those
> since, you are sometimes getting a 251 error, which may be indicative of a
> hardware problem. Is cooling perhaps a problem? Dust inside the computer?
>
> The main problem I've had with 4.13 in Linux is lost trickles, but that
> appears to be a database problem, not necessarily a Linux problem.
>
Let me add that I have another machine, running Red Hat Linux 9 with all the updates they have. It too, gets these "no heartbeat" messages and exits with 251 error messages.

So it does not depend on the hardware, the OS. This seems to go on forever. If I am wasting my time running climate-prediction, I might as well quit and let my machines do more work on setiathome and protein-folding. Is no one else experiencing this problem? Is anyone working on solving it?
ID: 13052 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : I keep getting Client Errors

©2024 cpdn.org