Questions and Answers : Unix/Linux : Problems after Climate site down
Message board moderation
Author | Message |
---|---|
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
I run boinc under SuSE 9.0. I have my time evenly divided between seti@home and climateprediction.net. Everything appeared to be running ok, and still does on seti@home. But it appears that the climateprediction site was down or unaccessable to me for a couple of days. After a day or so of this boinc was still getting a processing data from seti@home but unable to connect to climatepredection. I stopped boinc and restarted it, telling it to stop when it was done with the current seti data. When it stopped I ran the old seti@home for about 3 days. When the climate site came back up yesterday I had the old seti@home stop when it finished the current data set. Then I restarted boinc. Now all of my results on the client site say, "client error". I figured the software was smart enough to pick up where it left off when the site went down. This did not happen to the seti results. boinc appears to have pickup at seti right where it left off and it crunching away. Any thoughts on what is wrong on the climate side and how to fix it? This is still happening to every work unit. They all end due to client error. Looking at the error codes I see that they were all 26 or 251. core_client_version>4.19 process exited with code 251 (0xfb)10 No heartbeat from core client for 31 sec - exiting |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
I have searched the site looking for the meaning of error code 251 and cannot find it. Does anyone have any idea why this is happening to every work unit? |
Send message Joined: 17 Aug 04 Posts: 753 Credit: 9,804,700 RAC: 0 |
Error 251 seems to be a variant of error -5 as in <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2348">this thread</a>. It is difficult to give clear advice, because it could be hardware, OS, program incompatibility, etc. But it could also be a problem with the CPDN client - 4.12 has only been released recently to fix other problems. |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
Andrew, Thank you for the response. I have run seti@home classic for over 4000 work units without a problem. seti@home under boinc runs fine. But the Climateapps seem to error out every work unit. None have gone without error. Should I just stop this until a new version of the client software comes out? If I continue to run it like this will it screw up things on the science end? When a new version of the client comes out will boinc upload it and use in automagically or do I need to do something? Thanks Steve |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Hi, Stephen, I see that you had four Trickles from the first WU. Judging from the Model number, it was possibly one of the not-so-good WU. Yesterday's failure might be from the same WU set. Not today's, though. (If the failures leave a RunID Directory in Projects Directory with 3 files, the end of the zipped yabsd.out file may have the reason for the failure, if Negative Pressure or Negative Theta.) Is your Athlon overclocked? CPDN hammers a machine, both CPU and HD, and is apt to fail, whereas Projects with short WU get through okay. Overclocked machines are especially vulnerable. Verifying with Prime95 is a good idea. Folks running two or more Projects report a lot of the problems we see on the Boards. Do you have your Preferences option (Edit: in "Your account") set to leave in memory when suspended? (It's a good idea for CPDN.) Looks like your OS could be SuSE 9.0. Should not be problems there. Which boinc version? How much memory? "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
Boinc version is 4.19 My cpu is an AMD xp2400+ and is not over clocked. I have 1GB of PC2700 ddr memory. I will look at the files you mentioned if it fails again. I post the data here. As far as folks with Boinc running more then one project having problems, I thought that was what Boine was for? Thanks Steve |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
> Boinc version is 4.19 My cpu is an AMD xp2400+ and is not over clocked. I > have 1GB of PC2700 ddr memory. I will look at the files you mentioned if it > fails again. I post the data here. > > As far as folks with Boinc running more then one project having problems, I > thought that was what Boine was for? > > Thanks Steve Hi, Steve, To be sure. And in my case, running P4s, boinc allows parallel CPDN runs -- sonething we couldn't do in Classic CPDN, thanks to M$ Registry limitations. (There were no Linux or MAC versions in Classic.) You have a heavy setup and I don't see an obvious problem. From what I've read over time on these Boards, though, some AMD rigs have problems with CPDN, though they more than meet Specs required to run this beast. ... Someone with more tech savvy than me will have to wade in to help. This creature we run is a million-plus-line Fortran program developed over decades by climate scientists to run on super-computers. (In fact, the British Met. Office runs it on such machines for daily forecasts.) That it was ported and runs on PCs at all, I find quite amazing. Perhaps we shouldn't be surprised that some hardware combinations have difficulties -- while similar machines continually turn out successfully completed Models. I hope you find the culprit and are able to stay with the Project. Regards, Jim "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
> Boinc version is 4.19 My cpu is an AMD xp2400+ and is not over clocked. I > have 1GB of PC2700 ddr memory. I will look at the files you mentioned if it > fails again. I post the data here. > > As far as folks with Boinc running more then one project having problems, I > thought that was what Boine was for? > > Thanks Steve Hi, Steve, To be sure. And in my case, running P4s, boinc allows parallel CPDN runs -- sonething we couldn't do in Classic CPDN, thanks to M$ Registry limitations. (There were no Linux or MAC versions in Classic.) You have a heavy setup and I don't see an obvious problem. From what I've read over time on these Boards, though, some AMD rigs have problems with CPDN, though they more than meet Specs required to run this beast. ... Someone with more tech savvy than me will have to wade in to help. This creature we run is a million-plus-line Fortran program developed over decades by climate scientists to run on super-computers. (In fact, the British Met. Office runs it on such machines for daily forecasts.) That it was ported and runs on PCs at all, I find quite amazing. Perhaps we shouldn't be surprised that some hardware combinations have difficulties -- while similar machines continually turn out successfully completed Models. I hope you find the culprit and are able to stay with the Project. Regards, Jim "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
I found the file. At the end it said: ********************************************************************************* Model aborted with error code - 1 Routine and message:- P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED. ********************************************************************************* Steve |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
> Boinc version is 4.19 My cpu is an AMD xp2400+ and is not over clocked. I > have 1GB of PC2700 ddr memory. I will look at the files you mentioned if it > fails again. I post the data here. > > As far as folks with Boinc running more then one project having problems, I > thought that was what Boine was for? > > Thanks Steve Hi, Steve, To be sure. ... [Edit. Now I see that this WAS posted last evening, so I removed the Body of text. No evidence of successful posting was given and I couldn't connect with any other part of the BB. Odd.] (Rats. The Board went down while I wrote this ... Sorry for the delay.) Wrote that last night, US Pacific Coast time. Just saw your "Negative Pressure" post. That confirms that at least that one Model was from the bad batch. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
Jim, I looked at the latest run and it is looking good. It may be the case the I got 6 consecutive bad batches. What luck. I hope that this is the case as this was making me crazy. I could not find anything wrong on my end. You won't believe how many hours I devoted to going through my computer with a fine tooth comb trying to find something wrong. I'll keep you posted. Thank You very much for your help. I'll let you know how this goes. Stephen Hawkins NG0G |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
> Jim, > > I looked at the latest run and it is looking good. It may be the case the I > got 6 consecutive bad batches. What luck. I hope that this is the case as > this was making me crazy. I could not find anything wrong on my end. You > won't believe how many hours I devoted to going through my computer with a > fine tooth comb trying to find something wrong. I'll keep you posted. > > Thank You very much for your help. I'll let you know how this goes. > Stephen Hawkins NG0G Pleased to see it, Steve. Best of luck. Jim "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
Although it went a lot farther this time, it happened again. But this time with a different error msg. See Below: Result ID: 694277 Name 1wgy_300109643_0 ********************************************************************************* Model aborted with error code - 1 Routine and message:- ATM_DYN : NEGATIVE THETA DETECTED. ********************************************************************************* |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
> Although it went a lot farther this time, it happened again. But this time > with a different error msg. See Below: > > Result ID: 694277 Name 1wgy_300109643_0 > > > ********************************************************************************* > Model aborted with error code - 1 Routine and message:- > ATM_DYN : NEGATIVE THETA > DETECTED. > > ********************************************************************************* > Hmmm. More bad news about the current Linux version. Apparently, it's unstable, too. In a message from Tolu replying to my Email about a similar problem in Alpha, he stated that it is his #1 priority. That's good news, given the many high-priority things he has to do. <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2353"> See this Thread </a> "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
Well I guess the last two pieces of information that I need are: 1. Should I stop running this until a new, stable version is out and if so how will I know? I mean are these repeated abort due to "client error" screwing up your data? 2. Is boinc smart enough to see that there is a new version of your software, and down load and install it without intervention from me? |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
> Well I guess the last two pieces of information that I need are: > > 1. Should I stop running this until a new, stable version is out and if so how > will I know? I mean are these repeated abort due to "client error" screwing > up your data? > > 2. Is boinc smart enough to see that there is a new version of your software, > and down load and install it without intervention from me? Hi, again, Steve, We have to stop meeting like this; people will talk! Seriously, though, one of my machines crashed and downloaded 4.13 and a new Workunit about a half hour ago. I have no information on this release -- haven't seen a post here or on the Alpha BB yet. ... at least it is new and hope springs eternal. Or some such thing. Edit: Oops. Re. your #2, in the course of processing a completed run, normal or crashed, the new version will be detected and downloaded. Or, you can force the issue with -detach_project, then go through the -attach_project drill again. (You'll get a new machine ID in that process and have to do a "merge machines" drill to put the pieces together.) Jim "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
> > Hi, again, Steve, > > We have to stop meeting like this; people will talk! Jim, I know. I just heard about this on the BBC World News on the 49 meter band, and, dare I say it, Foxnews, and CNN. What will we do when Mom hears about this???? Seriously, I am still cooking along on the new data but on 4.12. I will keep you posted and try not to let the media know. Secret password = 1.4142135 * .707 Thank You, 73 49 111 01001001 Stephen Hawkins NG0G |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
> Edit: Oops. Re. your #2, in the course of processing a completed run, normal > or crashed, the new version will be detected and downloaded. Or, you can > force the issue with -detach_project, then go through the -attach_project > drill again. (You'll get a new machine ID in that process and have to do a > "merge machines" drill to put the pieces together.) > > He should be able to do a -reset_project and then won't have to merge any hosts. At least it's worked that way for me. George |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
George, Thank You. I will give that a shot. Steve NG0G 73 49 111 01001001 |
Send message Joined: 21 Mar 05 Posts: 13 Credit: 1,886 RAC: 0 |
George, Thank You. I will give that a shot. Steve NG0G 73 49 111 01001001 |
©2024 cpdn.org