climateprediction.net (CPDN) home page
Thread 'boinc deferring communication with project for 11 hours.....'

Thread 'boinc deferring communication with project for 11 hours.....'

Questions and Answers : Unix/Linux : boinc deferring communication with project for 11 hours.....
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user40785

Send message
Joined: 27 Jan 05
Posts: 7
Credit: 17,864
RAC: 0
Message 11984 - Posted: 21 Apr 2005, 8:31:40 UTC

This computer was disconnected from ISP for ~12hrs and boinc went into hibernation.
After reconncetion to isp boinc still hibernates.
How can boinc be woken up and resume calculations?
boinc was running well up until this point

System:
Linux 2.6.6, Debian/testing, athlon 1.2 GHz

the output:
:~/.boinc$boinc
2005-04-21 19:05:20 [---] Starting BOINC client version 4.19 for i686-pc-linux-gnu
2005-04-21 19:05:20 [climateprediction.net] Project prefs: no separate prefs for home; using your defaults
2005-04-21 19:05:20 [climateprediction.net] Host ID is 93024
2005-04-21 19:05:20 [---] General prefs: from climateprediction.net (last modified 2005-01-28 02:16:23)
2005-04-21 19:05:20 [---] General prefs: using separate prefs for home
2005-04-21 19:05:20 [climateprediction.net] Deferring communication with project for 11 hours, 19 minutes, and 55 seconds
2005-04-21 19:05:20 [climateprediction.net] Deferring communication with project for 11 hours, 19 minutes, and 55 seconds

ID: 11984 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 11996 - Posted: 21 Apr 2005, 11:33:58 UTC

Hi,
Did you try to stop BOINC and start it again with:

./boinc -return_results_immediately or
./boinc -update_prefs [URL of the project]


Arnaud
ID: 11996 · Report as offensive     Reply Quote
old_user40785

Send message
Joined: 27 Jan 05
Posts: 7
Credit: 17,864
RAC: 0
Message 12001 - Posted: 21 Apr 2005, 13:01:29 UTC - in response to Message 11996.  

> Hi,
> Did you try to stop BOINC and start it again with:
>
> ./boinc -return_results_immediately or
> ./boinc -update_prefs [URL of the project]
>
>
>

none of the these commands worked.
what you mean about the URL of the project?
I tried URL of host, result, workunit and account. Each of these didn't work.
I found this on the workunit page in the sterr out:

4.19
process got signal 11

3
11

No heartbeat from core client - exiting
No heartbeat from core client - exiting
No heartbeat from core client - exiting
......

Each time this is attempted, boinc decreases the time when it will restart, which by now is in ~7 hrs, so eventually it seems it will restart.

ID: 12001 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 12007 - Posted: 21 Apr 2005, 13:50:32 UTC

I meant ./boinc -update_prefs http://climateprediction.net

You should check that your wu has not crashed on the CP web site because the message in sterr out suggest that.
Note that I'm not a Linux geek, I just saw your message when browsing the forum.
Arnaud
ID: 12007 · Report as offensive     Reply Quote
old_user40785

Send message
Joined: 27 Jan 05
Posts: 7
Credit: 17,864
RAC: 0
Message 12008 - Posted: 21 Apr 2005, 14:15:59 UTC - in response to Message 12007.  

> I meant ./boinc -update_prefs http://climateprediction.net

that gave the same result, and boinc went into hibernation.

this is the rest of the error message:

No heartbeat from core client - exiting
zip I/O error: No space left on device

zip error: Could not create output file (../3vjp_000202668_0_1.zip)



boinc ran out of disk space and I spent last night sorting it out
ID: 12008 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 12009 - Posted: 21 Apr 2005, 14:59:04 UTC
Last modified: 21 Apr 2005, 15:03:19 UTC

Well, go in your account and check that you have enough disk space: at least 1 GB / CPDN Wu
As BOINC tried to create the zip files, it probably means that your model has crashed and BOINC try to upload the results of the crashed Wu to the servers.

Did you try a reboot, sometimes it solves problems with BOINC when it is stuck in a bad loop.
I have no other ideas :o(

If nothing works, I would suggest that you remove BOINC from your machine, reboot and install BOINC again: it is a drastic solution but I see no other one.
Or you can wait that someone else gives you a better advice :o)

Note that the CPDN servers are not working very well presently and lot of users have difficulties to contact the schedulers and upload their models.
4.12 models are known to be instable on Linux machine too.
Arnaud
ID: 12009 · Report as offensive     Reply Quote
old_user40785

Send message
Joined: 27 Jan 05
Posts: 7
Credit: 17,864
RAC: 0
Message 12018 - Posted: 21 Apr 2005, 21:31:39 UTC - in response to Message 12009.  

> Well, go in your account and check that you have enough disk space: at least 1
> GB / CPDN Wu

it now has 7.5Gb

> Did you try a reboot, sometimes it solves problems with BOINC when it is stuck
> in a bad loop.

rebooted and restarted BOINC...
BOINC stoped running Wu and is now running a new Wu.
The first Wu was about 90% completed and keen to get it finished...

here is the url of first Wu, maybe that can help to restart it
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=503724

> If nothing works, I would suggest that you remove BOINC from your machine,
> reboot and install BOINC again: it is a drastic solution but I see no other
> one.

later...


> Or you can wait that someone else gives you a better advice :o)
>
> Note that the CPDN servers are not working very well presently and lot of
> users have difficulties to contact the schedulers and upload their models.
> 4.12 models are known to be instable on Linux machine too.

this is using 4.19/ hadsm3 version 4.13
output from this morning after starting new Wu:

russell@athlonbox:~/.boinc$ boinc
2005-04-22 07:56:08 [---] Starting BOINC client version 4.19 for i686-pc-linux-g
nu
2005-04-22 07:56:08 [climateprediction.net] Project prefs: no separate prefs for
home; using your defaults
2005-04-22 07:56:08 [climateprediction.net] Host ID is 93024
2005-04-22 07:56:08 [---] General prefs: from climateprediction.net (last modifi
ed 2005-01-28 02:16:23)
2005-04-22 07:56:08 [---] General prefs: using separate prefs for home
2005-04-22 07:56:08 [climateprediction.net] Resuming computation for result 39br
_200173586_0 using hadsm3 version 4.13
Starting model in /mnt/hdd8/boinc/projects/climateprediction.net...
Created shared memory region key = 26015
Env Used=LD_LIBRARY_PATH=/mnt/hdd8/boinc/projects/climateprediction.net:/usr/loc
al/lib:/usr/lib:/lib
Starting model ID 39br_200173586 Phase 1
Stack size=48.00 MB
Waiting for model startup, this may take a minute...
39br_200173586 - PH 1 TS 000289 - 07/12/1810 00:30 - H:M:S=0000:26:51 AVG= 5.57
DLT= 0.00

> I have no other ideas :o(
>

thanks for your help, BOINC is now working, just not on the 1st Wu, and restarted 2nd. It would be nice to know how to complete the 1st.

ID: 12018 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 12025 - Posted: 22 Apr 2005, 4:50:00 UTC

Hi,
THe first Wu is marked as "Outcome: Client error " on the web site.
Boinc will not start this wu again, except if you have a backup of the whole BOINC directory made before the crash of the wu.
This is because all the information about the first Wu are contained in the XML files especially client_state.xml.
As long as the first Wu is marked as error in the XML files, boinc will not crunch it.

Arnaud
ID: 12025 · Report as offensive     Reply Quote
old_user40785

Send message
Joined: 27 Jan 05
Posts: 7
Credit: 17,864
RAC: 0
Message 12032 - Posted: 22 Apr 2005, 8:57:47 UTC - in response to Message 12025.  

> Hi,
> THe first Wu is marked as "Outcome: Client error " on the web site.
> Boinc will not start this wu again, except if you have a backup of the whole
> BOINC directory made before the crash of the wu.
> This is because all the information about the first Wu are contained in the
> XML files especially client_state.xml.
> As long as the first Wu is marked as error in the XML files, boinc will not
> crunch it.
>
>

so how can the Wu be restored back to health from where it left off?
or does it have to be restarted from 1810?
like editing the xml file?

this seems to shortsighted to have all this work just stopped because it ran out of disk space...phase 3 is almost finished, probably about 10% to go.

it's not like a disk crash which understandably difficult to recover
(which was reporting 1.6Gb free before it failed, but thats another issue I need to take up with the ext3 formum)


ID: 12032 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 12044 - Posted: 22 Apr 2005, 16:43:06 UTC
Last modified: 22 Apr 2005, 16:43:55 UTC

In fact, as far as I know but I'm not sure, when you have a new Wu, the informations about the old Wu are simply deleted from the xml files.
This is why I was asking you if you had a backup (the only way to have the xml files of the previous Wu)
It is not possible to restart a Wu if the information in the xml files are not present because there are mdk5 signatures attached to the files and if these signature are not OK, the model will be rejected by the upload servers.
Even if you could restart your model, you wouldn't be able to upload it.
For me, your Wu is lost.
For your present Wu, do a backup once a week, or just before change of phase so as your problem doesn't happen again.
Bye
Arnaud
ID: 12044 · Report as offensive     Reply Quote
old_user40785

Send message
Joined: 27 Jan 05
Posts: 7
Credit: 17,864
RAC: 0
Message 12048 - Posted: 22 Apr 2005, 21:02:09 UTC - in response to Message 12044.  

> In fact, as far as I know but I'm not sure, when you have a new Wu, the
> informations about the old Wu are simply deleted from the xml files.

ouch

> This is why I was asking you if you had a backup (the only way to have the xml
> files of the previous Wu)

no backup, I didn't foresee this event, no thought that it was possible to to fall over in such a way. If boinc site mentined this, I could have prepared for it in advance. The data files are still there ~550MB in directory under ~/.boinc/projects/climateprediction.net/
and xml files with same name in ~/.boinc/projects/

> It is not possible to restart a Wu if the information in the xml files are not
> present because there are mdk5 signatures attached to the files and if these
> signature are not OK, the model will be rejected by the upload servers.

bummer

> Even if you could restart your model, you wouldn't be able to upload it.
> For me, your Wu is lost.

should the data in crashed project be wiped or wait in case a way is found to restore it? I'll wait....

> For your present Wu, do a backup once a week, or just before change of phase
> so as your problem doesn't happen again.

ok, will setup to do rsync of .boinc directory every 24hrs to another disk
thanks

> Bye
>
ID: 12048 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : boinc deferring communication with project for 11 hours.....

©2024 cpdn.org