Message boards : Number crunching : 2 recent wah2 crashes - all tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
I've been having a great run over the past months, but I've noticed 2 recent crashes where all the 10 running tasks have bombed out; 4 Jul 21:15 & 26 Jun 11:57. These are mostly wah2 models, see the errors here (Comp ID 1290283) It's one thing for a model crash, but to take out all the others is a bit of a feat. My first thought was computer error, but now I'm not so sure. Looking through the work units, only one of these have gone onto completion on any other PC so far. (I've also found a large number of PCs with very high failure which I'll report in the appropriate place.) Some of the PCs are showing similar symptoms, in that there are multiple tasks on error at the same time stamp, check 1323410 and 1211978 This seems to be a bit too regular to be a coincidence. I assume the number of errors at each time is dependent on the number of tasks being run. Again, scanning through the work units, there seem to be a large number of wah2 failures, with only one or two PCs having (mixed) success. Example here Although I now don't think it's relevant, I've run chkdsk on the BOINC hard drive (no errors reported), but no memory test as yet. I can't really run a soak test until Sat morning. So the questions are: 1. Is the likelihood these are PC based errors? I suggest not. 2. If yes I can either run all the current task until they complete or explode, or stop processing until after the soak test. Suggestions? 3. Should we be excluding wah2 from the models to run? All this is very reminiscent to similar issues I had some moons ago. Spent ages trying to fault find when it turned out to be a model issue. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've only had one failure of WaH2 in ages on my machines. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Hi Les, Until these two crashes I could say the same. If I exclude the latest failures, so far this year the PC has crunched 182 task successfully and only 6 have had errors which I reckon is pretty reasonable. Hence the concern. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Hi Martin, Unhandled Exception Record - Read attempt being denied couldn't be some change or upgrade to anti-virus software could it? Of course, not having used windows this century I could be talking out of my fundament on this. The other thought that occurs to me for multiple tasks being taken out at once is power spikes/outages. Obviously you would know if your own had been an outage but it wouldn't be clear on machines not your own. Just been looking at the two machines you link to. 1323410 is a linux box with missing libraries according to the error message. |
Send message Joined: 16 Oct 11 Posts: 254 Credit: 15,954,577 RAC: 0 |
Hi Martin, Not sure of any relationships, but I've discovered that I can not run my Intel I7 CPU at 100% BOINC processing or multiple CPDN jobs fail at the same time(or close to the same time). I don't know root cause, but I have to run my I7 box allowing only a maximum of 6 CPUs active for BOINC processing. If I run at 100% using all 8 cores, CPDN jobs inevitably fail (but other projects such as SETI, MILKYWAY, EINSTAIN, etc. are not affected running simultaneously). Thinking perhaps this problem might be occurring on your Intel I7 machine. Just a thought -- certainly wouldn't explain the Linux machine failure. My machine running this way is ID 1266353 if you'd like to compare specs in more detail, please let me know. Art Masson St. Charles, IL |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I've been having a great run over the past months, but I've noticed 2 recent crashes where all the 10 running tasks have bombed out; 4 Jul 21:15 & 26 Jun 11:57. These are mostly wah2 models, see the errors here (Comp ID 1290283) It's one thing for a model crash, but to take out all the others is a bit of a feat. Do you have enough working memory? I once ran short, and I think it took out three work units at once. Also, check disk memory; the WAH2 work units grow in the disk space needed. At the end of the run (usually about a week for the longest ones for me), they double in the disk space needed as compared to when they started. It might lead to trouble if you don't have enough. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
I notice this machine http://climateapps2.oerc.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1324422 has some times when a bunch of tasks fail with the same time stamp but it doesn't have a very high success rate overall anyway. I suspect because with 8 cores and less than 8GB of Ram once graphics have taken their share that is the cause. Shouldn't be an issue on your box however Martin with 2GB/core. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Thanks everyone. As Dave noticed, plenty of memory. This is a Xeon workstation with 32 GB ECC RAM, and CPDN having its own 2 TB data hard drive. Art, I noticed your thread and decided this was a separate issue. I run 10 hyperthreading task (out of possible 16), as long ago decided that was the most effective combination - never caused an issue in the past. Sometimes I suspend BOINC when I'm doing some particularly intensive work on the PC, but again no link with the crash times. I'll run a soak test in the weekend, but will be surprised if it throws anything up - running CPDN does a pretty good job as a soak test anyway. It could be a software update of course, but I always suspend and exit BOINC before doing any work. I have noticed a couple of big Norton updates coming through recently that require the system to be rebooted, but cannot recall if the times were coincident. Norton notifies that a restart is required, but what I don't know is how much installation work it has done in the background before the notification. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I have noticed a couple of big Norton updates coming through recently that require the system to be rebooted, but cannot recall if the times were coincident. Norton notifies that a restart is required, but what I don't know is how much installation work it has done in the background before the notification. I think the butler did it. It may not even be related to the updates, but just the normal operation of the AV as it inspects the files. Even the exclusions do not prevent all problems. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Now this is curious. After posting above, I converted my CPDN machine from Windows 7 64-bit to Windows 10 64-bit. There were no CPDN work units running at the time, and it was a clean install; I did not upgrade from Win7 to Win10. This machine runs 3 cores (i5-3550) on CPDN, and the other core supports a GTX 970 on Folding. It is normally very stable, has an UPS on it to protect against power outages, and has plenty of memory and disk space. Also, I had disabled Windows Defender (which takes a registry hack by the way), and there was no other security software. It is a dedicated machine that I monitor remotely over the LAN, so I don't do web browsing, etc. on it. Only one HadAM3P-HadRM3P Europe v7.28 work unit downloaded initially, and it failed after almost 4 hours. Then three more downloaded over the course of several hours, and they all failed at the same time. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1402819 Whether this is due to something on my machine, or due to the work units themselves I don't know at this point. But the machine is now running only 8.12 Weather At Home 2 (wah2), which are generally more stable than the others. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
But was it with the lead pipe? ;-) Well, I did several stress tests over the weekend. Everything passed with flying colours. Over the years, Prime95 has come out as the favourite on the CPDN boards, so I started with that. Hmmm, looking at the memory use, it was only using about 6.7 GB. As I have 32 GB of RAM, testing only 25% of it doesn't seem like a good idea. So looked about and decided on Aida64 and ran its stress test for 25 hours. Not only did it seem to use all available memory, but from web chatter it seems to be better suited for modern processors. E.g. see the wiki article here and note the comment from the Asus engineer about the Intel X79 system, which just happens to be mine. OK, it might not push the maths the same as Prime97, but the PC is used for loads of other work as well, all of which work the system in ways Prime97 doesn't. Just for good measure I ran heaps of other programs like Photoshop, Excel etc while Aida64 was testing. Nothing seemed to throw it, although things did get a litte slow at times - no surprises on that one! Just to be sure I also did 4 hours with OCCT and another 4 hours with Prime97. So the upshot is that I'll keep the PC plodding on cranking out the results. I still think something odd is happening though. |
©2024 cpdn.org