climateprediction.net (CPDN) home page
Thread 'Both tasks crashed with no heartbeat problem.'

Thread 'Both tasks crashed with no heartbeat problem.'

Message boards : Number crunching : Both tasks crashed with no heartbeat problem.
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,009,815
RAC: 21,293
Message 46887 - Posted: 27 Aug 2013, 7:00:58 UTC

Both regional tasks on my netbook (1253464) have crashed with the no heartbeat problem. This while there was nothing else running apart from the OS (Linux
3.2.0-51-generic-pae) I notice that both tasks come from work units where one other task is shown as also having errored out and the third one looks like it has been resent about the same time mine crashed. Work units are 8547638 and 8541844. Is this likely to be my computer or something to do with the work units? I have restricted the netbook to regional models now but it has managed to complete one HADAM3CN model which to me implies reasonable stability.
ID: 46887 · Report as offensive     Reply Quote
Bellator
Avatar

Send message
Joined: 31 Mar 05
Posts: 44
Credit: 234,235
RAC: 0
Message 46888 - Posted: 27 Aug 2013, 7:53:44 UTC

For the past two days I have also had weird and not so wonderful things happening: computer errors, tasks ended without any credit claimed or otherwise, one new task downloaded. We will probably be told to grin and bear it...
ID: 46888 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46889 - Posted: 27 Aug 2013, 9:33:11 UTC - in response to Message 46888.  
Last modified: 27 Aug 2013, 10:11:08 UTC

For the past two days I have also had weird and not so wonderful things happening: computer errors, tasks ended without any credit claimed or otherwise, one new task downloaded. We will probably be told to grin and bear it...


* computer errors
We'd have to look through them - some computer errors might be able to be resolved at your end, some are actually model errors. What I tend to do is to have a look at the previous runs of that workunit, if they all crashed in the same way then it is not a problem at your end. If however a different user got further than you did on the same workunit, then it indicates that something happened on your end.

* tasks ended without any credit claimed
This will be fixed sooner or later... the credit processing hasn't been run yet. Once it has, then everyone will suddenly catch up with all their credit (& end up with crazily high RAC)

* one new task downloaded
New work was generated last week, but it was all picked up over the weekend. Currently everyone is hunting for reissued tasks.

* We will probably be told to grin and bear it...
Yeah that's about right :-)



Going back to the original query about 'no heartbeat' - this is a warning which appears when Boinc loses contact with the project application (i.e., the model in this case). If there are 100 consecutive 'no hearbeat' messages, Boinc itself will abort the job - this is very irritating and several of us complained to Berkeley back when the heartbeat feature was first introduced. The heartbeat can be lost for several reasons - the app might be stuck on something, the PC might be busy doing something I/O intensive, or the network stack on your PC might have frozen.

(Here's a ticket from 6 years ago regarding the vulnerability of Boinc to network interruptions).
https://boinc.berkeley.edu/trac/ticket/113#
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46889 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,009,815
RAC: 21,293
Message 46890 - Posted: 27 Aug 2013, 10:02:58 UTC

Thanks Mike,

I have got further than anyone else with these two work units so far but both failing at exactly the same time 8.42.35 on the same day makes me suspect a power surge/drop or something affecting the computer.

I am going to give it at least a week possibly even a month before I look at credits again!
ID: 46890 · Report as offensive     Reply Quote

Message boards : Number crunching : Both tasks crashed with no heartbeat problem.

©2024 cpdn.org