climateprediction.net (CPDN) home page
Thread 'Project communication failed'

Thread 'Project communication failed'

Message boards : Number crunching : Project communication failed
Message board moderation

To post messages, you must log in.

AuthorMessage
blyons123

Send message
Joined: 21 Sep 15
Posts: 8
Credit: 4,854,775
RAC: 0
Message 54739 - Posted: 2 Sep 2016, 15:08:58 UTC

I've been trying to upload for 2 days????

9/2/2016 9:17:32 AM | | Project communication failed: attempting access to reference site
9/2/2016 9:17:34 AM | | Internet access OK - project servers may be temporarily down.
ID: 54739 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 54740 - Posted: 2 Sep 2016, 20:06:37 UTC - in response to Message 54739.  

Models go to servers all over the world:
England, North America, Mexico, and Australia.
So you need to be more specific about which model type is having problems.

But if I have to guess, it may be Mexico. In which case there's a sticky post near the top of the Number crunching section, called Uploading Mexico models.

ID: 54740 · Report as offensive     Reply Quote
blyons123

Send message
Joined: 21 Sep 15
Posts: 8
Credit: 4,854,775
RAC: 0
Message 54858 - Posted: 28 Sep 2016, 13:37:42 UTC - in response to Message 54740.  

I don't remember what the project was. I unfortunately had to abort after a week to clear the queue.
ID: 54858 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 54863 - Posted: 28 Sep 2016, 21:22:59 UTC - in response to Message 54858.  

Indeed. 22Sep was a bad day for that machine -- nine crashes. (No MEX tasks among them.)

However, one item in "stderr" suggests a likely problem: All nine have pages of:
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
. . .

This indicates that your settings don't include a tick in the box to leave CPDN in memory when suspended (the machine has 16Meg. for 8 CPU threads). Sooner or later, all that swapping tends to bite CPDN because of the large number of incoming / outgoing files. We have long recommended leaving CPDN in memory when suspended.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 54863 · Report as offensive     Reply Quote
DadX

Send message
Joined: 30 Aug 06
Posts: 27
Credit: 1,887,860
RAC: 1,613
Message 54865 - Posted: 29 Sep 2016, 17:08:15 UTC - in response to Message 54863.  

I was wonder about those "Suspended CPDN Monitor - Suspend request from BOINC..." messages. Even with successful completions I get pages and pages of them and I do have LAIM checked. Why would are they happening?

ID: 54865 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 54866 - Posted: 29 Sep 2016, 18:21:26 UTC - in response to Message 54865.  

Sorry, it's beyond my understanding. I was unaware of that problem with the box ticked. Anything I add would be pure conjecture. Hopefully, someone 'out there' knows that part of boinc code.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 54866 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 54867 - Posted: 29 Sep 2016, 19:31:16 UTC

There's a preference to suspend when CPU usage is above a certain percentage, or perhaps "while computer is in use" depending on boinc version. These can cause a lot of suspends as well.
ID: 54867 · Report as offensive     Reply Quote
timoshea888

Send message
Joined: 29 Jun 15
Posts: 1
Credit: 99,611
RAC: 0
Message 54877 - Posted: 2 Oct 2016, 23:02:57 UTC

Hi all

I started 2 SAS50 jobs about 5.5 days ago. Several zip files have been sent back to the project servers with no obvious problem.

However, 3 jobs have been stalled from the beginning of the work, each retry stopping at the same percent progress (in one case, at a tantalising 98.14%).

The client has made dozens of attempts at uploading these zips, but has backed off for between approximately 1 to 4 hours. The log file has many entries reporting transient http errors and project communications failures.

Most zips get through with no error. The stalled zips suffer these errors several times every day.

I have BOINC client 7.6.31 (x64) running on Mint Linux with kernel version 4.4.0-21-generic.

Both of the SAS50 jobs are coming up for completion in less than 24 hours. I would like to know whether there's anything I can do to break the zip logjam, or whether there is a problem at the server end that needs to be fixed.

Thanks
ID: 54877 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 54878 - Posted: 3 Oct 2016, 0:49:28 UTC - in response to Message 54877.  

Possibly a restart:

Suspend BOINC.
Wait a few seconds to allow the models to stop.
Exit BOINC.
Re-start BOINC.
Unsuspend BOINC.

The models should restart from the previous check point.
And the uploads "may" start uploading.

Models for different areas go to different servers, usually somewhere in the area being modelled.
So it's always a good idea to give a link to the models in question, especially when there's lots of computers and lots of running models.

I think that the SAMs go to South America, and the SAS to Africa, so different servers.

ID: 54878 · Report as offensive     Reply Quote

Message boards : Number crunching : Project communication failed

©2024 cpdn.org