climateprediction.net home page
Boinc Manager Unable to Connect With Client

Boinc Manager Unable to Connect With Client

Questions and Answers : Windows : Boinc Manager Unable to Connect With Client
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile GarryNearWinnipeg

Send message
Joined: 28 Jan 06
Posts: 17
Credit: 8,891,420
RAC: 504
Message 45643 - Posted: 11 Mar 2013, 23:27:01 UTC

Running hadam3p on XP. Current experiment started around Nov.17/12. Its about 30% done. Feb.17 had a blue screen caused by defective DVD reader/writer. After recovery Boinc wouldn't run - got message "Boinc Manager is not able to connect to a Boinc client. Would you like to try to connect again? YES/NO." I assumed a file(s) got corrupted in the crash so I restored BOINC from my most recent backup. Restarted Boinc. All ok for about 3 weeks. Today when I started up the PC and opened Boinc I got the same error message again, however, this time no system crash - no apparent reason for the error at all - did a normal suspend and exit last night and startup today was normal. Did a restore from latest backup and restarted Boinc but still got same error. Tried "Select Computer" - used both "blank" field and "Localhost" for name but still same error. Is this experiment now "ndg"? Any suggestions what to try now?
ID: 45643 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45644 - Posted: 12 Mar 2013, 0:24:44 UTC - in response to Message 45643.  
Last modified: 12 Mar 2013, 0:31:02 UTC

The reason for that message is that the gui (the manager) isn't able to connect to the core client, which is the part that does the actual work.
So the Manager can't "see" what the client is doing.

The worker may actually be running, along with the various cpdn programs. On way to check is with Windows Tasks Manager.
The manager is called boincmgr.exe, and the client is boinc.exe

The reason that the 2 parts aren't talking is because something is in the way.

From Ageless on the BOINC/dev forums:
If not a firewall, then it's usually permission problems.

The 2 parts 'talk' on port 31416, and Windows has a habit of using this when it shouldn't. Updates for example can change things.

There are so many possibilities that a specific answer is difficult.

PS
If the model is hadcm3n_zgbn_1920_40_008254393_1 which was sent on 26 Nov 2012, then it IS still running, and trickled on the 10th of March.
Backups: Here
ID: 45644 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,728,292
RAC: 3,041
Message 45645 - Posted: 12 Mar 2013, 0:30:37 UTC
Last modified: 12 Mar 2013, 0:41:02 UTC

In CPDN we normally say that task deadlines don't matter. However, the project team generating and using the HADCM3N models do actually have a tight project schedule, which is reflected in the BOINC deadlines. The task list for that computer (here) shows the task as having exceeded its deadline. The task has submitted three trickles since the deadline, so you could persist and try to rescue it, particularly as no-one else in the work units seems to have finished it; in fact, it may be heading for a physics-related crash anyway looking at other Windows and Linux attempts. If it's that late and there are questions about the model's integrity and likely fate then in your position I would abandon it.

Since BOINC has had a few crashes it might be a good idea to do a project reset (or detach/re-attach) to clear out any corrupt files and accumulated junk.

There isn't much work at the moment, but tasks do appear from time to time - so you might have a bit of a wait until something else downloads.

[Edit: Oops, Les got there first, but with a different point ...]
ID: 45645 · Report as offensive     Reply Quote
Profile GarryNearWinnipeg

Send message
Joined: 28 Jan 06
Posts: 17
Credit: 8,891,420
RAC: 504
Message 45647 - Posted: 12 Mar 2013, 16:01:28 UTC

Thank you both, Les and Iaian. I just restarted Boinc and it immediately opened and did a CPU Benchmark run and now is apparently purring along. Thanks for the insight into the interoperational facets between manager and client, Les. It seems that one of the "many possibilities" that could interfere with the two connecting is most likely the explanation. Iaian - yes this experiment is 2 weeks overdue but I read on another thread that this is fairly common with long experiments and that the warning message is coming from BOINC, not CP.NET and we should just ignore it and wait to see what develops. If you're right about it heading for a physics-related crash anyway I should still get credits for as much as it accomplishes before it crashes, shouldn't I?
ID: 45647 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 45648 - Posted: 12 Mar 2013, 16:10:37 UTC - in response to Message 45647.  

Yes you will still get the credits for the trickles.
ID: 45648 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45650 - Posted: 12 Mar 2013, 19:20:03 UTC

As Dave said, you'll get the credits. But this particular type of model is, at present, being used for a project where the data is needed ASAP, and not 'years down the track'.
See the description of the experiment here: RAPID-RAPIT: What is the risk of thermohaline circulation collapse?, and a thread by the researchers here: Welcome to the RAPIT experiment!.

Also, the sooner these long models are completed the less time for them to suffer from hardware failure.

Another aspect that may need to be considered, is that the project is tending towards data being collected and stored at the researchers Uni, which is at several places around the world. And all of the models now are just part of a long time series, with each one being joined to the short-time-run model that provided the data to start it, and to the one that follows, for which it provides that starting data. This may well extend the total time period for an individual run to over a thousand years, unless the physics of it causes it to wobble off course and be terminated.

As such, long 'deadlines' may be a thing of the past, as the people involved won't want to wait years for one part of a thread to show up. They may just give up and re-submit the data to another computer to try and get it back sooner.


Backups: Here
ID: 45650 · Report as offensive     Reply Quote
Profile GarryNearWinnipeg

Send message
Joined: 28 Jan 06
Posts: 17
Credit: 8,891,420
RAC: 504
Message 45656 - Posted: 13 Mar 2013, 14:37:34 UTC

Les, thanks for the insights into this model and the links to the big picture with RAPIT. Having the big picture obviously helps one appreciate their own small contribution. I, too, only run CP during the day so am not providing trickles very quickly, however, I'll continue to add what I can. This is where the "big push" is currently so abondoning this model now would be a negative action on my part.
ID: 45656 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 45657 - Posted: 13 Mar 2013, 16:04:23 UTC

As such, long 'deadlines' may be a thing of the past, as the people involved won't want to wait years for one part of a thread to show up. They may just give up and re-submit the data to another computer to try and get it back sooner.


I guess that means the question as to whether to abort or not depends on how far through the task you are. If over three quarters I think I would continue. Less than half I might think about aborting and hoping to get some of the regional models witch have longer deadlines and finish a lot faster. It also makes me think that in the long term if I am to stay with the project I may need to look at getting a faster box.
ID: 45657 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,728,292
RAC: 3,041
Message 45658 - Posted: 13 Mar 2013, 16:38:20 UTC - in response to Message 45657.  

... It also makes me think that in the long term if I am to stay with the project I may need to look at getting a faster box.

The RAPIT project is a departure from normal practice not only in its 'real' deadlines but in the tightness of those deadlines. Sit down with a pocket calculator for a few minutes and it readily becomes apparent that the ~90-day deadline can only be met by a computer that is running for a substantial fraction of the deadline period. My Mac mini, for example, completed its most recent HADCM3N in ~20% of the time available. In other words it would have to be running for about 5 hours a day. As it happens I don't use that machine much and my electricity is generated renewably, so I don't mind (see here for a comparison of marginal and total energy costs for HADSM3). However, the project's method is to use volunteers' 'spare' cycles: a utilisation of 20% is stretching that definition.

As ever, the moderators try to encourage the project to provide information that will enable volunteers to give 'informed consent', and the project staff have indeed updated quite a few of CPDN's information pages. If people know what they're about to tackle then that's about as good as it can get in BOINC.
ID: 45658 · Report as offensive     Reply Quote
Profile GarryNearWinnipeg

Send message
Joined: 28 Jan 06
Posts: 17
Credit: 8,891,420
RAC: 504
Message 45671 - Posted: 18 Mar 2013, 20:36:08 UTC - in response to Message 45657.  

Hey, Dave - check your private mail in-box.
ID: 45671 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46027 - Posted: 22 Apr 2013, 22:35:03 UTC

Hi All,

Apologies, I used to know all this stuff inside-out, but now I have a bunch of basic questions :-)



I'm just setting up my new PC (Win7 64-bit, Boinc 7.0.28 64-bit), and I've been having trouble with the Boinc manager talking to the boinc service. It all seems to work normally for a few hours, but then the manager locks up & comes up with the 'trying to connect to boinc, exit/retry' message. I never get it back after that point until I reboot.

Task manager shows that the service is running normally (but this means that I cannot cleanly shut down Boinc before shutting the PC down).

I tried the previous version, 6.12.34, also 64-bit, but it behaved in the same way.

Is there a better version? Would it be worth using the 32-bit Boinc instead? My old PC was running 6.10.18 on Win7 32-bit.

I wondered if it might be the MS firewall, but I don't see Boinc listed there at all. I don't really remember how to configure the firewall any more so I might be looking in the wrong place.


In terms of the hardware, the CPU is an Intel 3770K quad with hyperthreading enabled, 16GB RAM, water cooled, gigabyte motherboard, it is running at 4.2GHz, prime95 stable well over 24 hours when running 8 worker threads.
Boinc is now running on it's own 1TB HD since I didn't want it using the main 3TB HD because the main HD is SSD-cached with ISRT (CPDN would eat the 64gb SSD in no time). Although whnn I was getting the problems described above it was still on the main HD.


-Cheers,

Mike
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46027 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46028 - Posted: 22 Apr 2013, 22:44:21 UTC - in response to Message 46027.  

...
I wondered if it might be the MS firewall, but I don't see Boinc listed there at all. I don't really remember how to configure the firewall any more so I might be looking in the wrong place.
...


Naturally I only spot the way to configure the firewall after I've posted!

Control panel / Windows firewall

'Allow a program or feature through windows firewall'

'Allow another program'

& Then browse to 'Program Files\Boinc', select each executable in turn, add, then give it permissions.


Will see if that helps...
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46028 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 46029 - Posted: 22 Apr 2013, 23:22:16 UTC

ID: 46029 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46030 - Posted: 22 Apr 2013, 23:41:23 UTC

Hi Mike

I don't know if it's related or useful, but I get something similar with 32 bits on one of my machines. They're Q6600 running XP.

One uses 6.10.18, and on re-starting BOINC, it will lock up for a while.
If I'm in one of the tabs and try to switch to a different one too soon, I may get that tab to open, but it'll be blank. And then I usually get a BOINC pop up about not being able to connect. The trick is to wait for a while to allow the climate models to get going first.
Probably all cores very busy.

The one running 6.2.18 doesn't show this behaviour.

ID: 46030 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46031 - Posted: 22 Apr 2013, 23:59:58 UTC


Many thanks Les & Mo :-) Long time no see.


I see from that forum link (Boinc 7 FAQ) that trying to go back to Boinc 6 would have failed, perhaps that explains part of the trouble I was having.

In theory the firewall should be letting the two parts talk to each other now, I'll see how it reacts in the next few days. The 'connecting' message did just reappear, but only for a few seconds, whereas before it was sticking for hours (overnight / until I rebooted).

-Cheers,

Mike
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46031 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46053 - Posted: 26 Apr 2013, 0:28:25 UTC
Last modified: 26 Apr 2013, 0:30:22 UTC

Looks like it's solved - no disconnections from the client in the last few days, whereas before it would lose it's connection within a few hours :-)



Models seem to be running well - 0.7 s/ts for a coupled model, whereas on my old PC it was around 2 s/ts. Not sure how it will react with more models of course (currently boinc is set up to use 6 of the 8 threads, and I have only managed to pick up 2 models so far, the rest is taken up with filler projects).
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46053 · Report as offensive     Reply Quote

Questions and Answers : Windows : Boinc Manager Unable to Connect With Client

©2024 cpdn.org