Questions and Answers : Unix/Linux : boinc deferring communication with project for 11 hours.....
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Jan 05 Posts: 7 Credit: 17,864 RAC: 0 |
This computer was disconnected from ISP for ~12hrs and boinc went into hibernation. After reconncetion to isp boinc still hibernates. How can boinc be woken up and resume calculations? boinc was running well up until this point System: Linux 2.6.6, Debian/testing, athlon 1.2 GHz the output: :~/.boinc$boinc 2005-04-21 19:05:20 [---] Starting BOINC client version 4.19 for i686-pc-linux-gnu 2005-04-21 19:05:20 [climateprediction.net] Project prefs: no separate prefs for home; using your defaults 2005-04-21 19:05:20 [climateprediction.net] Host ID is 93024 2005-04-21 19:05:20 [---] General prefs: from climateprediction.net (last modified 2005-01-28 02:16:23) 2005-04-21 19:05:20 [---] General prefs: using separate prefs for home 2005-04-21 19:05:20 [climateprediction.net] Deferring communication with project for 11 hours, 19 minutes, and 55 seconds 2005-04-21 19:05:20 [climateprediction.net] Deferring communication with project for 11 hours, 19 minutes, and 55 seconds |
Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0 |
Hi, Did you try to stop BOINC and start it again with: ./boinc -return_results_immediately or ./boinc -update_prefs [URL of the project] Arnaud |
Send message Joined: 27 Jan 05 Posts: 7 Credit: 17,864 RAC: 0 |
> Hi, > Did you try to stop BOINC and start it again with: > > ./boinc -return_results_immediately or > ./boinc -update_prefs [URL of the project] > > > none of the these commands worked. what you mean about the URL of the project? I tried URL of host, result, workunit and account. Each of these didn't work. I found this on the workunit page in the sterr out: 4.19 process got signal 11 3 11 No heartbeat from core client - exiting No heartbeat from core client - exiting No heartbeat from core client - exiting ...... Each time this is attempted, boinc decreases the time when it will restart, which by now is in ~7 hrs, so eventually it seems it will restart. |
Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0 |
I meant ./boinc -update_prefs http://climateprediction.net You should check that your wu has not crashed on the CP web site because the message in sterr out suggest that. Note that I'm not a Linux geek, I just saw your message when browsing the forum. Arnaud |
Send message Joined: 27 Jan 05 Posts: 7 Credit: 17,864 RAC: 0 |
> I meant ./boinc -update_prefs http://climateprediction.net that gave the same result, and boinc went into hibernation. this is the rest of the error message: No heartbeat from core client - exiting zip I/O error: No space left on device zip error: Could not create output file (../3vjp_000202668_0_1.zip) boinc ran out of disk space and I spent last night sorting it out |
Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0 |
Well, go in your account and check that you have enough disk space: at least 1 GB / CPDN Wu As BOINC tried to create the zip files, it probably means that your model has crashed and BOINC try to upload the results of the crashed Wu to the servers. Did you try a reboot, sometimes it solves problems with BOINC when it is stuck in a bad loop. I have no other ideas :o( If nothing works, I would suggest that you remove BOINC from your machine, reboot and install BOINC again: it is a drastic solution but I see no other one. Or you can wait that someone else gives you a better advice :o) Note that the CPDN servers are not working very well presently and lot of users have difficulties to contact the schedulers and upload their models. 4.12 models are known to be instable on Linux machine too. Arnaud |
Send message Joined: 27 Jan 05 Posts: 7 Credit: 17,864 RAC: 0 |
> Well, go in your account and check that you have enough disk space: at least 1 > GB / CPDN Wu it now has 7.5Gb > Did you try a reboot, sometimes it solves problems with BOINC when it is stuck > in a bad loop. rebooted and restarted BOINC... BOINC stoped running Wu and is now running a new Wu. The first Wu was about 90% completed and keen to get it finished... here is the url of first Wu, maybe that can help to restart it http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=503724 > If nothing works, I would suggest that you remove BOINC from your machine, > reboot and install BOINC again: it is a drastic solution but I see no other > one. later... > Or you can wait that someone else gives you a better advice :o) > > Note that the CPDN servers are not working very well presently and lot of > users have difficulties to contact the schedulers and upload their models. > 4.12 models are known to be instable on Linux machine too. this is using 4.19/ hadsm3 version 4.13 output from this morning after starting new Wu: russell@athlonbox:~/.boinc$ boinc 2005-04-22 07:56:08 [---] Starting BOINC client version 4.19 for i686-pc-linux-g nu 2005-04-22 07:56:08 [climateprediction.net] Project prefs: no separate prefs for home; using your defaults 2005-04-22 07:56:08 [climateprediction.net] Host ID is 93024 2005-04-22 07:56:08 [---] General prefs: from climateprediction.net (last modifi ed 2005-01-28 02:16:23) 2005-04-22 07:56:08 [---] General prefs: using separate prefs for home 2005-04-22 07:56:08 [climateprediction.net] Resuming computation for result 39br _200173586_0 using hadsm3 version 4.13 Starting model in /mnt/hdd8/boinc/projects/climateprediction.net... Created shared memory region key = 26015 Env Used=LD_LIBRARY_PATH=/mnt/hdd8/boinc/projects/climateprediction.net:/usr/loc al/lib:/usr/lib:/lib Starting model ID 39br_200173586 Phase 1 Stack size=48.00 MB Waiting for model startup, this may take a minute... 39br_200173586 - PH 1 TS 000289 - 07/12/1810 00:30 - H:M:S=0000:26:51 AVG= 5.57 DLT= 0.00 > I have no other ideas :o( > thanks for your help, BOINC is now working, just not on the 1st Wu, and restarted 2nd. It would be nice to know how to complete the 1st. |
Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0 |
Hi, THe first Wu is marked as "Outcome: Client error " on the web site. Boinc will not start this wu again, except if you have a backup of the whole BOINC directory made before the crash of the wu. This is because all the information about the first Wu are contained in the XML files especially client_state.xml. As long as the first Wu is marked as error in the XML files, boinc will not crunch it. Arnaud |
Send message Joined: 27 Jan 05 Posts: 7 Credit: 17,864 RAC: 0 |
> Hi, > THe first Wu is marked as "Outcome: Client error " on the web site. > Boinc will not start this wu again, except if you have a backup of the whole > BOINC directory made before the crash of the wu. > This is because all the information about the first Wu are contained in the > XML files especially client_state.xml. > As long as the first Wu is marked as error in the XML files, boinc will not > crunch it. > > so how can the Wu be restored back to health from where it left off? or does it have to be restarted from 1810? like editing the xml file? this seems to shortsighted to have all this work just stopped because it ran out of disk space...phase 3 is almost finished, probably about 10% to go. it's not like a disk crash which understandably difficult to recover (which was reporting 1.6Gb free before it failed, but thats another issue I need to take up with the ext3 formum) |
Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0 |
In fact, as far as I know but I'm not sure, when you have a new Wu, the informations about the old Wu are simply deleted from the xml files. This is why I was asking you if you had a backup (the only way to have the xml files of the previous Wu) It is not possible to restart a Wu if the information in the xml files are not present because there are mdk5 signatures attached to the files and if these signature are not OK, the model will be rejected by the upload servers. Even if you could restart your model, you wouldn't be able to upload it. For me, your Wu is lost. For your present Wu, do a backup once a week, or just before change of phase so as your problem doesn't happen again. Bye Arnaud |
Send message Joined: 27 Jan 05 Posts: 7 Credit: 17,864 RAC: 0 |
> In fact, as far as I know but I'm not sure, when you have a new Wu, the > informations about the old Wu are simply deleted from the xml files. ouch > This is why I was asking you if you had a backup (the only way to have the xml > files of the previous Wu) no backup, I didn't foresee this event, no thought that it was possible to to fall over in such a way. If boinc site mentined this, I could have prepared for it in advance. The data files are still there ~550MB in directory under ~/.boinc/projects/climateprediction.net/ and xml files with same name in ~/.boinc/projects/ > It is not possible to restart a Wu if the information in the xml files are not > present because there are mdk5 signatures attached to the files and if these > signature are not OK, the model will be rejected by the upload servers. bummer > Even if you could restart your model, you wouldn't be able to upload it. > For me, your Wu is lost. should the data in crashed project be wiped or wait in case a way is found to restore it? I'll wait.... > For your present Wu, do a backup once a week, or just before change of phase > so as your problem doesn't happen again. ok, will setup to do rsync of .boinc directory every 24hrs to another disk thanks > Bye > |
©2024 cpdn.org