climateprediction.net (CPDN) home page
Thread 'Model crashed: INITTIME: Atmosphere basis time mismatch'

Thread 'Model crashed: INITTIME: Atmosphere basis time mismatch'

Message boards : Number crunching : Model crashed: INITTIME: Atmosphere basis time mismatch
Message board moderation

To post messages, you must log in.

AuthorMessage
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 48450 - Posted: 19 Mar 2014, 12:16:23 UTC
Last modified: 19 Mar 2014, 12:20:30 UTC

New batch of eu's this morning but they all fail in a few seconds with the INITTIME error. The major nuisance here is that with limited bandwidth it takes minutes to download WU's that then crash in a few seconds of run-time. Likely the whole batch of several thousand will fail with this error?

>>edit -- now got at least one that has run for several minutes, so fear that whole batch bad not justified.
ID: 48450 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 48452 - Posted: 19 Mar 2014, 12:29:58 UTC - in response to Message 48450.  
Last modified: 19 Mar 2014, 12:33:47 UTC

I suspended one of my tasks to see if the 1996 unit I had downloaded today would crash or not and it didn't. I have also downloaded one 2013 model and one 2002 model. Seems a strange mix if they are all part of the batch released - I suspect they are not.

Edit: Two of them actually downloaded yesterday evening at some point.
ID: 48452 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,824,485
RAC: 4,956
Message 48454 - Posted: 19 Mar 2014, 12:47:21 UTC

There is a mix of EU models at the moment. Two I got this morning are reissued timeouts from 5 April 2013. Nothing to do with the flooding attribution at all.
ID: 48454 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 48455 - Posted: 19 Mar 2014, 13:58:31 UTC

After more (slow) downloads --
All the models I've got today are eu models, all 2013, all but the first 3 started OK.
The 3 with the INITTIME error on startup were named a4my, a4mz, arn1 .

The INITTIME error is from some kind of inconsistent parameters in the WU, yes?
ID: 48455 · Report as offensive     Reply Quote
ProfileBonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,756,611
RAC: 3,303
Message 48456 - Posted: 19 Mar 2014, 14:58:38 UTC

Same error over 19 workunits:



<core_client_version>7.2.39</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>o32_A2_1984_2020_N96_f.gz</file_name>
<error_code>-224 (permanent HTTP error)</error_code>
<error_message>permanent HTTP error</error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>so2dms_N96_2013_12_2015_02f.gz</file_name>
<error_code>-224 (permanent HTTP error)</error_code>
<error_message>permanent HTTP error</error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>ancil_OSTIA_seaice_2014.gz</file_name>
<error_code>-224 (permanent HTTP error)</error_code>
<error_message>permanent HTTP error</error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>ancil_OSTIA_SST_2014.gz</file_name>
<error_code>-224 (permanent HTTP error)</error_code>
<error_message>permanent HTTP error</error_message>
</file_xfer_error>

</message>
]]>

ID: 48456 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 48457 - Posted: 19 Mar 2014, 15:16:34 UTC

Hadam3p_eu_qgp9_2004 (d/l last night), a5qm_2013 and a5qc_2013 (d/l this morning) running normally, so far. WU a5qk_2013 ready to start when a core becomes available.

Looks like a patchy glitch.

HTH
ID: 48457 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48461 - Posted: 19 Mar 2014, 19:22:29 UTC

I've been gradually allowing new work over the last few hours, and all models are for 2013, a5 and a6 series.
All except one are original. The resend ran for 6+ hours on the original computer.

Oops! The one that just finished downloading on this computer has just failed after 19 seconds.

I'll log it, then upload and see what happened.


ID: 48461 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48462 - Posted: 19 Mar 2014, 19:25:22 UTC - in response to Message 48456.  

Bonsai

That list says that you have BOINC 7.2.39

There's a post on BOINC/dev somewhere about that version having a bad bug. I think that it was something to do with file transfers. :(

ID: 48462 · Report as offensive     Reply Quote
ProfileBonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,756,611
RAC: 3,303
Message 48463 - Posted: 19 Mar 2014, 19:41:13 UTC

changed to 7.2.42
ID: 48463 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48464 - Posted: 19 Mar 2014, 20:39:33 UTC

The failure was due to the INITTIME problem.
The replacement is a4u9, so this batch is, so far: a4.., a5.., and a6..

And they're going fast.

ID: 48464 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 48465 - Posted: 19 Mar 2014, 21:42:56 UTC - in response to Message 48461.  

I got three this morning that are eu_a8 (2013) series.
Each one has over 7 hours on it, so if they fail, they are not failing fast.
ID: 48465 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 48470 - Posted: 20 Mar 2014, 7:52:14 UTC

All of the recent tasks that crashed for me, did so at under two minutes. On the graphics they never got to the stage of showing a running model. I have therefore suspended my running models to check the downloaded ones and I now have two running and three which I believe to be good waiting to run. If I had waited for my current models to finish before testing, I would have missed the boat on those I do have.
ID: 48470 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 28 May 17
Posts: 49
Credit: 17,332,112
RAC: 7,027
Message 63989 - Posted: 25 May 2021, 22:37:11 UTC

Reviving an old thread as it was the 1st result on Google.

I am getting the thread title errors on HadSM4 at N144 tasks. No CPU usage then abort after around a minute. PC completes N216 tasks.
https://www.cpdn.org/result.php?resultid=22071424
https://www.cpdn.org/result.php?resultid=22071421

I've got several more paused as BOINC downloaded too many. I'd like to fix if possible before resuming.
ID: 63989 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 63991 - Posted: 26 May 2021, 9:00:41 UTC - in response to Message 63989.  
Last modified: 26 May 2021, 9:04:16 UTC

Reviving an old thread as it was the 1st result on Google.

I am getting the thread title errors on HadSM4 at N144 tasks. No CPU usage then abort after around a minute. PC completes N216 tasks.
https://www.cpdn.org/result.php?resultid=22071424
https://www.cpdn.org/result.php?resultid=22071421

I've got several more paused as BOINC downloaded too many. I'd like to fix if possible before resuming.


This has been reported to the project, looking through tasks from this batch I have so far found one other with this type of crash and will let project know. As of about 0100Hrs UTC there were only 46 of this batch running so it is difficult to know how widespread the problem is yet but having found a third one out of those 46, I suspect a problem with the ancillary files for the tasks.

Edit:As of 13 minutes ago, the batch has been paused while they do some checking. Also subsequent batch which was about to go out paused as part of same experiment.
ID: 63991 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 28 May 17
Posts: 49
Credit: 17,332,112
RAC: 7,027
Message 63992 - Posted: 26 May 2021, 9:29:23 UTC - in response to Message 63991.  

Reviving an old thread as it was the 1st result on Google.

I am getting the thread title errors on HadSM4 at N144 tasks. No CPU usage then abort after around a minute. PC completes N216 tasks.
https://www.cpdn.org/result.php?resultid=22071424
https://www.cpdn.org/result.php?resultid=22071421

I've got several more paused as BOINC downloaded too many. I'd like to fix if possible before resuming.


This has been reported to the project, looking through tasks from this batch I have so far found one other with this type of crash and will let project know. As of about 0100Hrs UTC there were only 46 of this batch running so it is difficult to know how widespread the problem is yet but having found a third one out of those 46, I suspect a problem with the ancillary files for the tasks.

Edit:As of 13 minutes ago, the batch has been paused while they do some checking. Also subsequent batch which was about to go out paused as part of same experiment.


Ok thanks for checking. I resumed the other 4 tasks on that PC since it seemed like batch issues vs missing libs or something on the PC. Same result.
ID: 63992 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 63994 - Posted: 26 May 2021, 13:53:46 UTC - in response to Message 63992.  

They think the issue has been identified and will I presume either cancel the tasks and resend with the correctly dated files or fix them in situ on the server. (I don't know if they can do that.)
ID: 63994 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 63995 - Posted: 26 May 2021, 15:50:53 UTC - in response to Message 63994.  

They think the issue has been identified and will I presume either cancel the tasks and resend with the correctly dated files or fix them in situ on the server. (I don't know if they can do that.)


Looks like withdrawn as there are only 12 waiting to go out on the server. Sarah thinks it is a start date issue on one or more of the files. They will re-appear with this corrected later today or tomorrow at a guess. Any sitting on machines now that have gotten past the one or two minute stage may well be OK but if I had any in my queue I would be aborting till the fixed ones come out.
ID: 63995 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 28 May 17
Posts: 49
Credit: 17,332,112
RAC: 7,027
Message 63997 - Posted: 26 May 2021, 22:13:03 UTC

Resends are still being sent out. Its a lot of downloading to just abort in a minute. And since there are no app selections here I've got a ton of N216 work.
ID: 63997 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 63998 - Posted: 27 May 2021, 0:00:07 UTC - in response to Message 63989.  

Reviving an old thread as it was the 1st result on Google.

I am getting the thread title errors on HadSM4 at N144 tasks. No CPU usage then abort after around a minute. PC completes N216 tasks.


Me, too. I got only one N144, and it bombed very fast. I got it yesterday (I think), but my client server did not try to run it until today.

ask 22073399
Name 	hadsm4_a1in_201310_6_907_012084122_0
Workunit 	12084122
Created 	25 May 2021, 11:15:17 UTC
Sent 	26 May 2021, 8:08:45 UTC
Report deadline 	8 May 2022, 13:28:45 UTC
Received 	26 May 2021, 14:57:57 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	22 (0x00000016) Unknown error code
Computer ID 	1511241
Run time 	25 sec
CPU time 	
Validate state 	Invalid
Credit 	0.00
Device peak FLOPS 	6.51 GFLOPS
Application version 	UK Met Office HadSM4 at N144 resolution v8.02
i686-pc-linux-gnu
Peak working set size 	10.14 MB
Peak swap size 	16.77 MB
Peak disk usage 	0.02 MB
Stderr 	

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>

Model crashed: INITTIME: Atmosphere basis time mismatch                                                                                                                                                                                                                        tmp/xnnuj.pipe_dummy                                                            

Model crashed: INITTIME: Atmosphere basis time mismatch                                                                                                                                                                                                                        



tmp/xnnuj.pipe_dummy                                                            

Model crashed: INITTIME: Atmosphere basis time mismatch                                                                                                                                                                                                                        tmp/xnnuj.pipe_dummy                                                            

Model crashed: INITTIME: Atmosphere basis time mismatch                                                                                                                                                                                                                        tmp/xnnuj.pipe_dummy                                                            

Model crashed: INITTIME: Atmosphere basis time mismatch                                                                                                                                                                                                                        tmp/xnnuj.pipe_dummy                                                            

Model crashed: INITTIME: Atmosphere basis time mismatch                                                                                                                                                                                                                        tmp/xnnuj.pipe_dummy                                                            
Sorry, too many model crashes! :-(
09:56:54 (266358): called boinc_finish(22)

</stderr_txt>
]]>

ID: 63998 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 63999 - Posted: 27 May 2021, 5:08:34 UTC

Am told the script is fixed now. The batches involved will probably go out some time after, "9-5" staff arrive in Oxford.
ID: 63999 · Report as offensive     Reply Quote

Message boards : Number crunching : Model crashed: INITTIME: Atmosphere basis time mismatch

©2024 cpdn.org