Message boards : Number crunching : WU crashed, am investigating.
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753 |
This morning, I was investigating the crash of a WU from RNA World. It was one of their XXL long runner units which doesn't checkpoint. My CPDN wu also crashed during the night, coincidence? The log of the CPDN shows nothing useful, but the RNA issue started at 01:38 with a "No heartbeat" message and various stops and starts for an hour. I have also installed BOINC on my wifes laptop. This morning, there were a flock of "upload pending" issues which went away when I poked them. I am wondering if something my ISP has done could be to blame. In January, we had to change from my homebrew setup, which has worked fine for years, to TDC's "Trio" pack, which includes a wired and wireless router. What could be going on here? Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
My first suspicion would be antivirus software. Often this puts an exclusive lock on the file which causes work units to crash if BOINC tries to write to them while the AV software has a file locked. Have you changed your AV software as part of your recent changes. Changes to how you connect to the internet are unlikely to have caused these problems. In your av software if you exclude the BOINC data directories it will prevent this cause of the problem. Also worth always suspending and exiting boinc via the File>Exit menu items before shutting down/hibernating the computer as this can also be a cause of crashes. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
No heartbeat is when the manager loses contact with the core client for a certain period. Lots of stopping and starting is something that the climate models are not designed for, and which can crash a model if this occurs at a critical point. Backups: Here |
Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,080,724 RAC: 753 |
I have Avast AV software on here, and my wifes machine. It has been so for a long time, (years), and I've not seen these problems before. I can see that today is not the first CPDN wu that I've lost, just not noticed it before, there was another earlier this year, ie after the IP switch, none showing from before. It has been running CPDN since 2004. The machines normally run 24/7 crunching BOINC, they don't shut down/hibernate. I have an Apache webserver on this machine, which also has been problematic since January. I can't see how the IP switch could be causing the problem(s) I am seeing either, but the timing of all this is hard to ignore. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524 |
If you haven't done so, go to tools and computing preferences and make sure the "Leave applications in memory while suspended" box is ticked. If the software changes on your machine mean that something is using the processor intensively for short periods of time, that could be causing the problem. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Did all of your BOINC tasks fail at the same time (you can check that in the BOINC data directory's stdoutdae.txt file)? Your computer is listed as running XP and I've occasionally had that happen on my XP system when something caused the IP stack to become corrupt. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
©2024 cpdn.org