Message boards : Number crunching : THREE CRASHES IN A ROW
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Had 3 failures in rapid succession. Each started, ran less than 2 minutes and crashed. As they are all �2� they have failed twice before on other machines. hadam3p_pnw_a3pt_198612_15_302_010264399_2, hadam3p_pnw_a3ps_198612_15_302_010264398_2, hadam3p_pnw_a3pm_198512_15_302_010264392_2 The cause for the crash of all 3 would appear to be �INITTIME: Atmosphere basis time mismatch� Stderr for hadam3p_pnw_a3pm_198512_15_302_010264392_2 Follows: core_client_version>7.6.22</core_client_version> <![CDATA[ <stderr_txt> Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048 Leaving CPDN_Main::Monitor... 14:55:18 (5152): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_1.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_2.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_3.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_4.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_5.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_6.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_7.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_8.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_9.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_10.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_11.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_12.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_13.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_14.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_15.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_pnw_a3pm_198512_15_302_010264392_2_16.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> Trickle |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Jim INITTIME: etc, is a file mismatch in the data set, so not your computer's "fault". |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Hi Jim I figured that. It would just be nice to get some new work were the files do match. Oh well, it is snowing here. They�re saying 12 - 24 inches on the ground by Sunday morning. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
You may have gotten one of their "tests" that they "announce" under "Recent CPDN Submissions" on the right of the following page: http://www.climateprediction.net/ |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
If they were from a test batch then that batch definitely needs more work. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
And now that they know that something's wrong with them, and what it is, they can fix it. Welcome back to beta testing Jim. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Where possible these tests will be carried out offline or on our cpdn-beta project, but some tests do need to be run on our main project. And as the beta site is down presumably all will be run on main site till/if beta comes back up. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Task 19212446, hadam3p_anz_k3ab_201212_12_306_010266812_0 crashed before the first trickle. This is a batch 306 task. Stderr follows: <core_client_version>7.6.22</core_client_version> <![CDATA[ <stderr_txt> 05:26:34 (7184): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 05:26:35 (7184): No heartbeat from core client for 30 sec - exiting 05:26:36 (7184): No heartbeat from core client for 30 sec - exiting 05:26:38 (7184): No heartbeat from core client for 30 sec - exiting 05:26:39 (7184): No heartbeat from core client for 30 sec - exiting 05:26:40 (7184): No heartbeat from core client for 30 sec - exiting 10:26:27 (7984): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 10:32:30 (1268): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=7736, selfPID=7736, iMonCtr=2 10:48:30 (5592): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 10:51:40 (9080): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 10:54:48 (5452): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 10:54:49 (5452): No heartbeat from core client for 30 sec - exiting 10:54:50 (5452): No heartbeat from core client for 30 sec - exiting 10:54:51 (5452): No heartbeat from core client for 30 sec - exiting 10:54:52 (5452): No heartbeat from core client for 30 sec - exiting 10:54:53 (5452): No heartbeat from core client for 30 sec - exiting 10:54:54 (5452): No heartbeat from core client for 30 sec - exiting 10:54:55 (5452): No heartbeat from core client for 30 sec - exiting 10:54:56 (5452): No heartbeat from core client for 30 sec - exiting 11:07:43 (7512): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 11:19:59 (8688): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 11:20:00 (8688): No heartbeat from core client for 30 sec - exiting 11:20:01 (8688): No heartbeat from core client for 30 sec - exiting 11:20:02 (8688): No heartbeat from core client for 30 sec - exiting 11:20:03 (8688): No heartbeat from core client for 30 sec - exiting 11:20:04 (8688): No heartbeat from core client for 30 sec - exiting Global Worker:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=5932, iMonCtr=1 Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=0, iMonCtr=2 Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6636, selfPID=8552, iMonCtr=1 Model crash detected, will try to restart... Leaving CPDN_Main::Monitor... Called boinc_finish </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_1.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_2.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_3.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_4.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_5.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_6.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_7.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_8.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_9.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_10.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_11.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_12.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_anz_k3ab_201212_12_306_010266812_0_13.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> Trickle Click here No trickles! Copyright � 2002-2015 climateprediction.net LOADING... |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
As I understand it, the "heartbeat" is something created by the BOINC client, as it monitors the progress of tasks. This message is in turn displayed by the BOINC Manager as the "status" of the task. If something gets in the way of this communication, (via port 31416 I think), then the manager decides that the tasks is dead, and tells the client to start the process of collecting whatever error data it can, and sending this back to the server. This problem can be an anti virus, a flurry of disk writes that cause the client to wait a while, etc. So whatever the cause, it's most likely your computer rather than the model. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,978,383 RAC: 14,247 |
I've got 5 from batch 306 waiting to start - one of which will be in the next day or so. I'll see how it (they) progress. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The following on the heartbeat problem may be of interest. (I just did a search on heartbeat boinc) From 2008: Heartbeat replacement From February 2015: Network connection and No heartbeat message #113 This one has some interesting stuff way down. It's possible that keeping the Network connection turned off, and only enabling it now and then after checking that it's working, may fix it This is what I do. |
©2024 cpdn.org