climateprediction.net (CPDN) home page
Thread 'This good or bad?'

Thread 'This good or bad?'

Message boards : Number crunching : This good or bad?
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user11395

Send message
Joined: 3 Sep 04
Posts: 3
Credit: 796,077
RAC: 0
Message 13738 - Posted: 22 Jun 2005, 22:53:54 UTC
Last modified: 22 Jun 2005, 22:54:54 UTC

Today I was greeted with a Windows XP closed application error message:

Faulting application hadsm3um_4.12_windows_intelx86.exe, version 0.0.0.0, faulting module unknown, version 0.0.0.0, fault address 0x00000001.

Is this due to the server outage or am I just the only one? :) Never had a cpdn WU to do this.

Yesterday hadsm began this activity:

2005-06-21 23:32:07 [climateprediction.net] Result 3q7u_200195693_0 exited with zero status but no 'finished' file
2005-06-21 23:32:07 [climateprediction.net] If this happens repeatedly you may need to reset the project.
2005-06-21 23:32:07 [climateprediction.net] Restarting result 3q7u_200195693_0 using hadsm3 version 4.12

(repeats a few times and an hour later)

2005-06-22 00:26:19 [climateprediction.net] Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
2005-06-22 00:26:22 [climateprediction.net] Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded

Everything was okay after that, the error messages disappeared. Half a day later is when the application error occurred apparently while hadsm was running:

2005-06-22 15:20:39 [climateprediction.net] Restarting result 3q7u_200195693_0 using hadsm3 version 4.12
2005-06-22 16:20:40 [climateprediction.net] Pausing result 3q7u_200195693_0 (removed from memory)

The error happened during this time at 15:31:00 per the windows application event log. But it looks as if hadsm never stopped. The WU has been crunched since then:

2005-06-22 16:27:55 [climateprediction.net] Restarting result 3q7u_200195693_0 using hadsm3 version 4.12
2005-06-22 17:27:55 [climateprediction.net] Pausing result 3q7u_200195693_0 (removed from memory)

I'm not worried about the credit situation with the server - this is all for the science but I'm not sure if this is a good thing or not - a faulting appliction that apparently kept running. Should I ditch this WU or just wait until the trickle server is up again anyway and see what happens?

<img src="http://predictor.scripps.edu/workunit.php?wuid=172"></img>
ID: 13738 · Report as offensive     Reply Quote
ProfileAndrew Hingston
Volunteer moderator

Send message
Joined: 17 Aug 04
Posts: 753
Credit: 9,804,700
RAC: 0
Message 13756 - Posted: 23 Jun 2005, 8:10:50 UTC

Can you say which BOINC version you are running?
ID: 13756 · Report as offensive     Reply Quote
old_user11395

Send message
Joined: 3 Sep 04
Posts: 3
Credit: 796,077
RAC: 0
Message 13787 - Posted: 23 Jun 2005, 23:24:01 UTC - in response to Message 13756.  
Last modified: 23 Jun 2005, 23:24:27 UTC

4.19
ID: 13787 · Report as offensive     Reply Quote
ProfileAndrew Hingston
Volunteer moderator

Send message
Joined: 17 Aug 04
Posts: 753
Credit: 9,804,700
RAC: 0
Message 13788 - Posted: 24 Jun 2005, 0:05:45 UTC

I asked about the BOINC version because some of us have experienced problems with the new BOINC Manager apparently losing contact with the application, but this would not apply here.

The error messages you got about exiting with no finished file suggest some sort of interruption during file handling, I believe, but might not be related to the other error message. I would check the graphics (right click on the app in the BOINC work tab)and see what the CPDN application seems to be doing. If it is running normally and the globe is as you would expect, then I would assume all is well and carry on, at least until somebody or something tells you otherwise. It would seem wise to retain a backup though in case it is a recurrent problem. I doubt that it related to the server problems.

I assume you closed and restarted BOINC.
ID: 13788 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 13792 - Posted: 24 Jun 2005, 7:36:04 UTC

One of my boxes behaves not more normal as it did before. After downloading a Hadsm_4.12 modell it continuously chrashes hadsm_4.12 and after doing this some times the the whole result crashes. This box completed already one run successfully with a lower version of Hadsm (I believe to remember that it was 4.04).

All the system equipment is the same as on my other hosts (WinXP pro SP2, Norton AV 2004, BOINC 4.19, no other AV, malware, spyware, or something else). I will watch at this attentively and if things doesn't getting normal, unfortunately I will have to detach this box from cp.net.

Ciao
ID: 13792 · Report as offensive     Reply Quote
ProfileAndrew Hingston
Volunteer moderator

Send message
Joined: 17 Aug 04
Posts: 753
Credit: 9,804,700
RAC: 0
Message 13795 - Posted: 24 Jun 2005, 7:49:04 UTC - in response to Message 13792.  

&gt; One of my boxes ....... continuously chrashes hadsm_4.12.

Can you tell us what error messages you are getting? Is it always the same ones?
ID: 13795 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 13799 - Posted: 24 Jun 2005, 8:41:21 UTC - in response to Message 13795.  


&gt; Can you tell us what error messages you are getting? Is it always the same
&gt; ones?

Hi Andrew,

its rather difficult for the moment because this box is located elsewhere. On tuesday I can have a look at the stderr.txt. But I don't know how long informations about crashes are keept in this logfile. But to have a look at <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=22924">its resultpage</a> one can find that they seem to have reported the exit code -5.

But I have a self written program to monitor my working boxes and their modelstates. Maybe I have the time to add an selfdiognostic eventlog which is reported to my server application in such cases. But this needs some time.

Ciao
ID: 13799 · Report as offensive     Reply Quote
ProfileAndrew Hingston
Volunteer moderator

Send message
Joined: 17 Aug 04
Posts: 753
Credit: 9,804,700
RAC: 0
Message 13801 - Posted: 24 Jun 2005, 9:45:31 UTC - in response to Message 13799.  

&gt;they seem to have reported the exit code -5.

For some reason I can only see a reported error for one of your WUs, <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=910782">this one</a>. As you say, though, it is the dreaded -5, which is pretty useless diagnostically. It just means the application crashed, which you already knew. :(

The number of these from users in the northern hemisphere will probably rise markedly with the increasing heat. Once you have eliminated overheating, flakey memory, etc, then you are into the possibility of software conflict or incompatibility, so the chances of your tracking it down may not be high.

I agree that the change from Hadsm 4.04 to 4.12 is probably significant, so you may have to wait for a new version.
ID: 13801 · Report as offensive     Reply Quote
old_user11395

Send message
Joined: 3 Sep 04
Posts: 3
Credit: 796,077
RAC: 0
Message 13852 - Posted: 25 Jun 2005, 23:12:10 UTC
Last modified: 25 Jun 2005, 23:14:10 UTC

&gt; I assume you closed and restarted BOINC.

Yeah, restarted the entire machine. Looked at the graphics &amp; it's fully interactive and it's been working fine since. I'm reasonably confident my hardware is still in pristine working order; I built it with the finest parts available at the time: AthlonXP 3200+, Corsair XMS Pro DDR400 1GB (matched pair) 2-2-2-5-1T, ATI Radeon 9800 AGP 128MB, SB Live! Digital Platinum, Biostar M7NCD (nForce2 chipset) - no overclocking, strictly better performing components.

Thanks for the ideas, Andrew. At least for now things are ok despite the unexplainable app crash. :)

<img src="http://www.boincstats.com/stats/banner.php?id=37226"></img>
ID: 13852 · Report as offensive     Reply Quote
old_user23880
Volunteer tester

Send message
Joined: 10 Oct 04
Posts: 223
Credit: 4,664
RAC: 0
Message 13855 - Posted: 26 Jun 2005, 6:30:10 UTC

Hi Travis and Smudodd

Maybe you are both suffering from the occasional incompatibility of Athlons with cpdn boinc, something to do with how the processor handles the calculations. Most Athlons handle cpdn boinc perfectly, but when there's a problem, it's more often with an Athlon than a Pentium, and -5 seems to be the typical error code indicating this.

I gave up the struggle to make boinc cpdn work on my Athlon and moved back to classic cpdn, which works beautifully. If you want to do this, you'll have to wait till the Milton Keynes server is up again after the outage, then download classic from the Open Uni course link.

Classic gives you no boinc credit, but it runs at about the same speed, gives you graphics and is equally useful to the researchers.
__________________________________________________

ID: 13855 · Report as offensive     Reply Quote

Message boards : Number crunching : This good or bad?

©2024 cpdn.org