Message boards : Number crunching : Both tasks crashed with no heartbeat problem.
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Both regional tasks on my netbook (1253464) have crashed with the no heartbeat problem. This while there was nothing else running apart from the OS (Linux 3.2.0-51-generic-pae) I notice that both tasks come from work units where one other task is shown as also having errored out and the third one looks like it has been resent about the same time mine crashed. Work units are 8547638 and 8541844. Is this likely to be my computer or something to do with the work units? I have restricted the netbook to regional models now but it has managed to complete one HADAM3CN model which to me implies reasonable stability. |
Send message Joined: 31 Mar 05 Posts: 44 Credit: 234,235 RAC: 0 |
For the past two days I have also had weird and not so wonderful things happening: computer errors, tasks ended without any credit claimed or otherwise, one new task downloaded. We will probably be told to grin and bear it... |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
For the past two days I have also had weird and not so wonderful things happening: computer errors, tasks ended without any credit claimed or otherwise, one new task downloaded. We will probably be told to grin and bear it... * computer errors We'd have to look through them - some computer errors might be able to be resolved at your end, some are actually model errors. What I tend to do is to have a look at the previous runs of that workunit, if they all crashed in the same way then it is not a problem at your end. If however a different user got further than you did on the same workunit, then it indicates that something happened on your end. * tasks ended without any credit claimed This will be fixed sooner or later... the credit processing hasn't been run yet. Once it has, then everyone will suddenly catch up with all their credit (& end up with crazily high RAC) * one new task downloaded New work was generated last week, but it was all picked up over the weekend. Currently everyone is hunting for reissued tasks. * We will probably be told to grin and bear it... Yeah that's about right :-) Going back to the original query about 'no heartbeat' - this is a warning which appears when Boinc loses contact with the project application (i.e., the model in this case). If there are 100 consecutive 'no hearbeat' messages, Boinc itself will abort the job - this is very irritating and several of us complained to Berkeley back when the heartbeat feature was first introduced. The heartbeat can be lost for several reasons - the app might be stuck on something, the PC might be busy doing something I/O intensive, or the network stack on your PC might have frozen. (Here's a ticket from 6 years ago regarding the vulnerability of Boinc to network interruptions). https://boinc.berkeley.edu/trac/ticket/113# I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Thanks Mike, I have got further than anyone else with these two work units so far but both failing at exactly the same time 8.42.35 on the same day makes me suspect a power surge/drop or something affecting the computer. I am going to give it at least a week possibly even a month before I look at credits again! |
©2024 cpdn.org