Thread 'Curious about "Error while computing..."'

Author	Message
Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 47797 - Posted: 17 Dec 2013, 14:32:26 UTC Sometimes I get a work unit, such as Workunit 8515857, that has "Error while computing" status for several other users. I always allow my machine to attempt them, and it very often succeeds. It did for Workunit 8515857 for example. And all the others before me failed in one way or another. Is this because the other users' computers are less reliable than mine? If we are running essentially the same program, with the same data, I would expect us all to fail or all to succeed. The difference I notice is that several of the failures were running various versions of Windows, though one was running Darwin, and I run Red Hat Enterprise Linux 6. Also, mine is an x86_64 machine with an Intel processor, and the others were either 32-bit or 64 bit machines. ID: 47797 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 47798 - Posted: 17 Dec 2013, 15:11:55 UTC It varies wildly. You can go to individuals and look at their computer/computers and see that nearly every model they run finishes successfully, except the ones that are sent to them with incorrectly setup files. Then you can go to other individuals and see that 90% of their tasks fail on one or more computers. A lot has to do with how their computers are setup, how they use them, and how they have configured boinc preferences. We know that most of the cpdn models sent out don't like being interrupted at certain points. The more frequently they are interrupted, the more likely some failure is to occur. So, if boinc is configured to remove the task from memory when suspended, or suspend if CPU usage higher than xx%, or if the computer is shutdown or hibernated without cleanly exiting boinc, then all those things increase the likelihood of task failures. Iain Inglis did an analysis of failures by processor and operating system several years ago. The results were in a thread on the old phpBB forum. Removing the computers with immediate failures of all tasks (an obviously misconfigured computer), some configurations seemed more likely to succeed than others. However, even then, it had more to do with how the computer is configured and used rather than whether it was an AMD or Intel running Linux, Windows or Darwin. ID: 47798 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192	Message 47799 - Posted: 17 Dec 2013, 17:57:34 UTC Last modified: 6 May 2014, 23:01:12 UTC Here's one version of the analysis geophi mentioned, for HADCM3N models. The first chart includes all models, of which only ~40% got to the first trickle (thicker blue line). The second chart looks at those models that submitted at least one trickle, of which just over 30% complete (again, thicker blue line). Platform-specific problems tend to come and go, so one platform might look bad for a particular batch of models and better at another time. For example, these charts show the devastating effect of the Mac permissions problem, which stops many Mac users getting to the first trickle; if they do that, however, they do relatively well. Progress of HADCM3N Models Progress of HADCM3N Models (that submit at least one trickle) ID: 47799 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 47802 - Posted: 17 Dec 2013, 19:09:32 UTC Nice charts. The really interesting one there is the 'Darwin' entry ... only a handful of Darwin boxes are correctly configured to be able to run CPDN, but when they do run past the first trickle, they're the most reliable. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 47802 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 47812 - Posted: 18 Dec 2013, 16:09:25 UTC - in response to Message 47798. I see. I see I was wise to draw no conclusions from the limited data I examined, for one thing. I have the BOINC client setup to always leave stuff in memory. I have 8 GBytes of the stuff and 2 GBytes is surely all I really need. I could probably put 256 or 384 GBytes in the box if I were crazy enough to do that. I normally have about 40 to 50 megabytes swapped out. Now sometimes the BOINC processes do get suspended (they have Linux nice value of 19) but my machine gets stopped only about once a month when I need to reboot it to run Windows, or to replace the Linux kernel. So mostly I do not experience problems like that. Though I notice it at times. Like this one: Task 15937757 Work Unit 8561112 Stderr starts out like this: <core_client_version>6.10.45</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... and ends like this: Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... 20:05:15 (2239): No heartbeat from core client for 30 sec - exiting Suspended CPDN Monitor - No 'heartbeat' from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Signal 15 received, exiting... Called boinc_finish Signal 15 received, exiting... Called boinc_finish Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Signal 15 received, exiting... Called boinc_finish Suspended CPDN Monitor - Suspend request from BOINC... * glibc detected * ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu: double free or corruption (out): 0x090f2de0 *** ======= Backtrace: ========= /lib/libc.so.6[0x6e3df1] /lib/libc.so.6[0x6e6531] and more boring stuff not worth including here. The funny thing is that it seems to actually have completed successfully with all the trickles delivered. ID: 47812 · Reply Quote