Thread 'Model crashed: INITTIME: Atmosphere basis time mismatch'

Author	Message
Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 48450 - Posted: 19 Mar 2014, 12:16:23 UTC Last modified: 19 Mar 2014, 12:20:30 UTC New batch of eu's this morning but they all fail in a few seconds with the INITTIME error. The major nuisance here is that with limited bandwidth it takes minutes to download WU's that then crash in a few seconds of run-time. Likely the whole batch of several thousand will fail with this error? >>edit -- now got at least one that has run for several minutes, so fear that whole batch bad not justified. ID: 48450 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 48452 - Posted: 19 Mar 2014, 12:29:58 UTC - in response to Message 48450. Last modified: 19 Mar 2014, 12:33:47 UTC I suspended one of my tasks to see if the 1996 unit I had downloaded today would crash or not and it didn't. I have also downloaded one 2013 model and one 2002 model. Seems a strange mix if they are all part of the batch released - I suspect they are not. Edit: Two of them actually downloaded yesterday evening at some point. ID: 48452 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,824,485 RAC: 4,956	Message 48454 - Posted: 19 Mar 2014, 12:47:21 UTC There is a mix of EU models at the moment. Two I got this morning are reissued timeouts from 5 April 2013. Nothing to do with the flooding attribution at all. ID: 48454 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 48455 - Posted: 19 Mar 2014, 13:58:31 UTC After more (slow) downloads -- All the models I've got today are eu models, all 2013, all but the first 3 started OK. The 3 with the INITTIME error on startup were named a4my, a4mz, arn1 . The INITTIME error is from some kind of inconsistent parameters in the WU, yes? ID: 48455 · Reply Quote

Bonsai911 Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,756,611 RAC: 3,303	Message 48456 - Posted: 19 Mar 2014, 14:58:38 UTC Same error over 19 workunits: <core_client_version>7.2.39</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>o32_A2_1984_2020_N96_f.gz</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> <file_xfer_error> <file_name>so2dms_N96_2013_12_2015_02f.gz</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> <file_xfer_error> <file_name>ancil_OSTIA_seaice_2014.gz</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> <file_xfer_error> <file_name>ancil_OSTIA_SST_2014.gz</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> </message> ]]> ID: 48456 · Reply Quote

Niall Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0	Message 48457 - Posted: 19 Mar 2014, 15:16:34 UTC Hadam3p_eu_qgp9_2004 (d/l last night), a5qm_2013 and a5qc_2013 (d/l this morning) running normally, so far. WU a5qk_2013 ready to start when a core becomes available. Looks like a patchy glitch. HTH ID: 48457 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 48461 - Posted: 19 Mar 2014, 19:22:29 UTC I've been gradually allowing new work over the last few hours, and all models are for 2013, a5 and a6 series. All except one are original. The resend ran for 6+ hours on the original computer. Oops! The one that just finished downloading on this computer has just failed after 19 seconds. I'll log it, then upload and see what happened. ID: 48461 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 48462 - Posted: 19 Mar 2014, 19:25:22 UTC - in response to Message 48456. Bonsai That list says that you have BOINC 7.2.39 There's a post on BOINC/dev somewhere about that version having a bad bug. I think that it was something to do with file transfers. :( ID: 48462 · Reply Quote

Bonsai911 Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,756,611 RAC: 3,303	Message 48463 - Posted: 19 Mar 2014, 19:41:13 UTC changed to 7.2.42 ID: 48463 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 48464 - Posted: 19 Mar 2014, 20:39:33 UTC The failure was due to the INITTIME problem. The replacement is a4u9, so this batch is, so far: a4.., a5.., and a6.. And they're going fast. ID: 48464 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 48465 - Posted: 19 Mar 2014, 21:42:56 UTC - in response to Message 48461. I got three this morning that are eu_a8 (2013) series. Each one has over 7 hours on it, so if they fail, they are not failing fast. ID: 48465 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 48470 - Posted: 20 Mar 2014, 7:52:14 UTC All of the recent tasks that crashed for me, did so at under two minutes. On the graphics they never got to the stage of showing a running model. I have therefore suspended my running models to check the downloaded ones and I now have two running and three which I believe to be good waiting to run. If I had waited for my current models to finish before testing, I would have missed the boat on those I do have. ID: 48470 · Reply Quote

mmonnin Send message Joined: 28 May 17 Posts: 49 Credit: 17,332,112 RAC: 7,027	Message 63989 - Posted: 25 May 2021, 22:37:11 UTC Reviving an old thread as it was the 1st result on Google. I am getting the thread title errors on HadSM4 at N144 tasks. No CPU usage then abort after around a minute. PC completes N216 tasks. https://www.cpdn.org/result.php?resultid=22071424 https://www.cpdn.org/result.php?resultid=22071421 I've got several more paused as BOINC downloaded too many. I'd like to fix if possible before resuming. ID: 63989 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 63991 - Posted: 26 May 2021, 9:00:41 UTC - in response to Message 63989. Last modified: 26 May 2021, 9:04:16 UTC Reviving an old thread as it was the 1st result on Google. I am getting the thread title errors on HadSM4 at N144 tasks. No CPU usage then abort after around a minute. PC completes N216 tasks. https://www.cpdn.org/result.php?resultid=22071424 https://www.cpdn.org/result.php?resultid=22071421 I've got several more paused as BOINC downloaded too many. I'd like to fix if possible before resuming. This has been reported to the project, looking through tasks from this batch I have so far found one other with this type of crash and will let project know. As of about 0100Hrs UTC there were only 46 of this batch running so it is difficult to know how widespread the problem is yet but having found a third one out of those 46, I suspect a problem with the ancillary files for the tasks. Edit:As of 13 minutes ago, the batch has been paused while they do some checking. Also subsequent batch which was about to go out paused as part of same experiment. ID: 63991 · Reply Quote

mmonnin Send message Joined: 28 May 17 Posts: 49 Credit: 17,332,112 RAC: 7,027	Message 63992 - Posted: 26 May 2021, 9:29:23 UTC - in response to Message 63991. Reviving an old thread as it was the 1st result on Google. I am getting the thread title errors on HadSM4 at N144 tasks. No CPU usage then abort after around a minute. PC completes N216 tasks. https://www.cpdn.org/result.php?resultid=22071424 https://www.cpdn.org/result.php?resultid=22071421 I've got several more paused as BOINC downloaded too many. I'd like to fix if possible before resuming. This has been reported to the project, looking through tasks from this batch I have so far found one other with this type of crash and will let project know. As of about 0100Hrs UTC there were only 46 of this batch running so it is difficult to know how widespread the problem is yet but having found a third one out of those 46, I suspect a problem with the ancillary files for the tasks. Edit:As of 13 minutes ago, the batch has been paused while they do some checking. Also subsequent batch which was about to go out paused as part of same experiment. Ok thanks for checking. I resumed the other 4 tasks on that PC since it seemed like batch issues vs missing libs or something on the PC. Same result. ID: 63992 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 63994 - Posted: 26 May 2021, 13:53:46 UTC - in response to Message 63992. They think the issue has been identified and will I presume either cancel the tasks and resend with the correctly dated files or fix them in situ on the server. (I don't know if they can do that.) ID: 63994 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 63995 - Posted: 26 May 2021, 15:50:53 UTC - in response to Message 63994. They think the issue has been identified and will I presume either cancel the tasks and resend with the correctly dated files or fix them in situ on the server. (I don't know if they can do that.) Looks like withdrawn as there are only 12 waiting to go out on the server. Sarah thinks it is a start date issue on one or more of the files. They will re-appear with this corrected later today or tomorrow at a guess. Any sitting on machines now that have gotten past the one or two minute stage may well be OK but if I had any in my queue I would be aborting till the fixed ones come out. ID: 63995 · Reply Quote

mmonnin Send message Joined: 28 May 17 Posts: 49 Credit: 17,332,112 RAC: 7,027	Message 63997 - Posted: 26 May 2021, 22:13:03 UTC Resends are still being sent out. Its a lot of downloading to just abort in a minute. And since there are no app selections here I've got a ton of N216 work. ID: 63997 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63998 - Posted: 27 May 2021, 0:00:07 UTC - in response to Message 63989. Reviving an old thread as it was the 1st result on Google. I am getting the thread title errors on HadSM4 at N144 tasks. No CPU usage then abort after around a minute. PC completes N216 tasks. Me, too. I got only one N144, and it bombed very fast. I got it yesterday (I think), but my client server did not try to run it until today. ask 22073399 Name hadsm4_a1in_201310_6_907_012084122_0 Workunit 12084122 Created 25 May 2021, 11:15:17 UTC Sent 26 May 2021, 8:08:45 UTC Report deadline 8 May 2022, 13:28:45 UTC Received 26 May 2021, 14:57:57 UTC Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x00000016) Unknown error code Computer ID 1511241 Run time 25 sec CPU time Validate state Invalid Credit 0.00 Device peak FLOPS 6.51 GFLOPS Application version UK Met Office HadSM4 at N144 resolution v8.02 i686-pc-linux-gnu Peak working set size 10.14 MB Peak swap size 16.77 MB Peak disk usage 0.02 MB Stderr <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( 09:56:54 (266358): called boinc_finish(22) </stderr_txt> ]]> ID: 63998 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 63999 - Posted: 27 May 2021, 5:08:34 UTC Am told the script is fixed now. The batches involved will probably go out some time after, "9-5" staff arrive in Oxford. ID: 63999 · Reply Quote