climateprediction.net (CPDN) home page
Thread 'THREE CRASHES IN A ROW'

Thread 'THREE CRASHES IN A ROW'

Message boards : Number crunching : THREE CRASHES IN A ROW
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 53297 - Posted: 23 Jan 2016, 2:54:56 UTC

Had 3 failures in rapid succession. Each started, ran less than 2 minutes and crashed. As they are all �2� they have failed twice before on other machines. hadam3p_pnw_a3pt_198612_15_302_010264399_2, hadam3p_pnw_a3ps_198612_15_302_010264398_2, hadam3p_pnw_a3pm_198512_15_302_010264392_2

The cause for the crash of all 3 would appear to be �INITTIME: Atmosphere basis time mismatch�

Stderr for hadam3p_pnw_a3pm_198512_15_302_010264392_2
Follows:

core_client_version>7.6.22</core_client_version>
<![CDATA[
<stderr_txt>

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
14:55:18 (5152): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_1.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_2.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_3.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_4.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_5.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_6.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_7.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_8.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_9.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_10.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_11.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_12.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_13.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_14.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_15.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_16.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>
Trickle


ID: 53297 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53298 - Posted: 23 Jan 2016, 3:01:59 UTC - in response to Message 53297.  

Hi Jim

INITTIME: etc, is a file mismatch in the data set, so not your computer's "fault".

ID: 53298 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 53299 - Posted: 23 Jan 2016, 8:30:27 UTC - in response to Message 53298.  

Hi Jim

INITTIME: etc, is a file mismatch in the data set, so not your computer's "fault".



I figured that. It would just be nice to get some new work were the files do match. Oh well, it is snowing here. They�re saying 12 - 24 inches on the ground by Sunday morning.


ID: 53299 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 53301 - Posted: 23 Jan 2016, 16:32:14 UTC

You may have gotten one of their "tests" that they "announce" under "Recent CPDN Submissions" on the right of the following page:

http://www.climateprediction.net/
ID: 53301 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 53302 - Posted: 23 Jan 2016, 19:11:38 UTC - in response to Message 53301.  

If they were from a test batch then that batch definitely needs more work.

ID: 53302 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53305 - Posted: 23 Jan 2016, 20:41:16 UTC - in response to Message 53302.  

And now that they know that something's wrong with them, and what it is, they can fix it.
Welcome back to beta testing Jim.


ID: 53305 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,008,987
RAC: 21,524
Message 53308 - Posted: 24 Jan 2016, 9:17:00 UTC

Where possible these tests will be carried out offline or on our cpdn-beta project, but some tests do need to be run on our main project.


And as the beta site is down presumably all will be run on main site till/if beta comes back up.
ID: 53308 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 53337 - Posted: 26 Jan 2016, 16:58:11 UTC
Last modified: 26 Jan 2016, 17:01:42 UTC

Task 19212446, hadam3p_anz_k3ab_201212_12_306_010266812_0 crashed before the first trickle. This is a batch 306 task.

Stderr follows:

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<stderr_txt>
05:26:34 (7184): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
05:26:35 (7184): No heartbeat from core client for 30 sec - exiting
05:26:36 (7184): No heartbeat from core client for 30 sec - exiting
05:26:38 (7184): No heartbeat from core client for 30 sec - exiting
05:26:39 (7184): No heartbeat from core client for 30 sec - exiting
05:26:40 (7184): No heartbeat from core client for 30 sec - exiting
10:26:27 (7984): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
10:32:30 (1268): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=7736, selfPID=7736, iMonCtr=2
10:48:30 (5592): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
10:51:40 (9080): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
10:54:48 (5452): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
10:54:49 (5452): No heartbeat from core client for 30 sec - exiting
10:54:50 (5452): No heartbeat from core client for 30 sec - exiting
10:54:51 (5452): No heartbeat from core client for 30 sec - exiting
10:54:52 (5452): No heartbeat from core client for 30 sec - exiting
10:54:53 (5452): No heartbeat from core client for 30 sec - exiting
10:54:54 (5452): No heartbeat from core client for 30 sec - exiting
10:54:55 (5452): No heartbeat from core client for 30 sec - exiting
10:54:56 (5452): No heartbeat from core client for 30 sec - exiting
11:07:43 (7512): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
11:19:59 (8688): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
11:20:00 (8688): No heartbeat from core client for 30 sec - exiting
11:20:01 (8688): No heartbeat from core client for 30 sec - exiting
11:20:02 (8688): No heartbeat from core client for 30 sec - exiting
11:20:03 (8688): No heartbeat from core client for 30 sec - exiting
11:20:04 (8688): No heartbeat from core client for 30 sec - exiting
Global Worker:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=5932, iMonCtr=1
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=0, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6636, selfPID=8552, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
Called boinc_finish

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_1.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_2.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_3.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_4.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_5.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_6.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_7.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_8.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_9.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_10.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_11.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_12.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_13.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>
Trickle Click here
No trickles!


Copyright � 2002-2015 climateprediction.net
LOADING...
ID: 53337 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53340 - Posted: 26 Jan 2016, 20:11:02 UTC

As I understand it, the "heartbeat" is something created by the BOINC client, as it monitors the progress of tasks. This message is in turn displayed by the BOINC Manager as the "status" of the task.

If something gets in the way of this communication, (via port 31416 I think), then the manager decides that the tasks is dead, and tells the client to start the process of collecting whatever error data it can, and sending this back to the server.

This problem can be an anti virus, a flurry of disk writes that cause the client to wait a while, etc.

So whatever the cause, it's most likely your computer rather than the model.

ID: 53340 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,978,383
RAC: 14,247
Message 53342 - Posted: 27 Jan 2016, 0:10:03 UTC - in response to Message 53340.  

I've got 5 from batch 306 waiting to start - one of which will be in the next day or so. I'll see how it (they) progress.
ID: 53342 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53344 - Posted: 27 Jan 2016, 3:21:33 UTC

The following on the heartbeat problem may be of interest. (I just did a search on heartbeat boinc)

From 2008:
Heartbeat replacement

From February 2015:
Network connection and No heartbeat message #113
This one has some interesting stuff way down.

It's possible that keeping the Network connection turned off, and only enabling it now and then after checking that it's working, may fix it
This is what I do.

ID: 53344 · Report as offensive     Reply Quote

Message boards : Number crunching : THREE CRASHES IN A ROW

©2024 cpdn.org