climateprediction.net (CPDN) home page
Thread 'WU crashed, am investigating.'

Thread 'WU crashed, am investigating.'

Message boards : Number crunching : WU crashed, am investigating.
Message board moderation

To post messages, you must log in.

AuthorMessage
Profileadrianxw
Avatar

Send message
Joined: 31 Aug 04
Posts: 145
Credit: 2,080,724
RAC: 753
Message 44399 - Posted: 14 Jun 2012, 7:39:43 UTC
Last modified: 14 Jun 2012, 7:58:36 UTC

This morning, I was investigating the crash of a WU from RNA World. It was one of their XXL long runner units which doesn't checkpoint.

My CPDN wu also crashed during the night, coincidence?

The log of the CPDN shows nothing useful, but the RNA issue started at 01:38 with a "No heartbeat" message and various stops and starts for an hour.

I have also installed BOINC on my wifes laptop. This morning, there were a flock of "upload pending" issues which went away when I poked them.

I am wondering if something my ISP has done could be to blame. In January, we had to change from my homebrew setup, which has worked fine for years, to TDC's "Trio" pack, which includes a wired and wireless router.

What could be going on here?
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 44399 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,008,987
RAC: 21,524
Message 44400 - Posted: 14 Jun 2012, 8:16:07 UTC - in response to Message 44399.  

My first suspicion would be antivirus software. Often this puts an exclusive lock on the file which causes work units to crash if BOINC tries to write to them while the AV software has a file locked. Have you changed your AV software as part of your recent changes. Changes to how you connect to the internet are unlikely to have caused these problems. In your av software if you exclude the BOINC data directories it will prevent this cause of the problem.

Also worth always suspending and exiting boinc via the File>Exit menu items before shutting down/hibernating the computer as this can also be a cause of crashes.
ID: 44400 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44401 - Posted: 14 Jun 2012, 8:38:15 UTC - in response to Message 44399.  

No heartbeat is when the manager loses contact with the core client for a certain period.
Lots of stopping and starting is something that the climate models are not designed for, and which can crash a model if this occurs at a critical point.


Backups: Here
ID: 44401 · Report as offensive     Reply Quote
Profileadrianxw
Avatar

Send message
Joined: 31 Aug 04
Posts: 145
Credit: 2,080,724
RAC: 753
Message 44402 - Posted: 14 Jun 2012, 9:00:56 UTC
Last modified: 14 Jun 2012, 9:25:14 UTC

I have Avast AV software on here, and my wifes machine. It has been so for a long time, (years), and I've not seen these problems before. I can see that today is not the first CPDN wu that I've lost, just not noticed it before, there was another earlier this year, ie after the IP switch, none showing from before. It has been running CPDN since 2004.

The machines normally run 24/7 crunching BOINC, they don't shut down/hibernate. I have an Apache webserver on this machine, which also has been problematic since January.

I can't see how the IP switch could be causing the problem(s) I am seeing either, but the timing of all this is hard to ignore.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 44402 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,008,987
RAC: 21,524
Message 44403 - Posted: 14 Jun 2012, 9:27:34 UTC - in response to Message 44402.  

If you haven't done so, go to tools and computing preferences and make sure the "Leave applications in memory while suspended" box is ticked. If the software changes on your machine mean that something is using the processor intensively for short periods of time, that could be causing the problem.
ID: 44403 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 44418 - Posted: 16 Jun 2012, 22:27:01 UTC - in response to Message 44402.  

Did all of your BOINC tasks fail at the same time (you can check that in the BOINC data directory's stdoutdae.txt file)? Your computer is listed as running XP and I've occasionally had that happen on my XP system when something caused the IP stack to become corrupt.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 44418 · Report as offensive     Reply Quote

Message boards : Number crunching : WU crashed, am investigating.

©2024 cpdn.org