climateprediction.net (CPDN) home page
Thread 'Latest HADCM3S WU's Crashing'

Thread 'Latest HADCM3S WU's Crashing'

Message boards : Number crunching : Latest HADCM3S WU's Crashing
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50123 - Posted: 11 Sep 2014, 13:23:06 UTC

I've just returned to the Project about a month ago after a 3 year absence from it. Using, by todays standard a relatively outdated PC till I get round to building another, I've run a few regionals for PNW and now running some ANZ's. Also completed without problem some HADCM3S's a week or so back but from the latest batch of those WU's put on the server yesterday, my machine has rapidly crashed 4. Checking the history of the WU's, all 4 seem to have suffered crashes on multiple machines, some of which have run many other types to completion. My failure mode is shown:

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
The device does not recognize the command.
(0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

Model crashed: INITTIME: Atmosphere basis time mismatch

Model crashed: INITTIME: Atmosphere basis time mismatch

Model crashed: INITTIME: Atmosphere basis time mismatch

Model crashed: INITTIME: Atmosphere basis time mismatch

Model crashed: INITTIME: Atmosphere basis time mismatch

Model crashed: INITTIME: Atmosphere basis time mismatch
Sorry, too many model crashes! :-(
09:25:21 (3020): called boinc_finish

</stderr_txt>
]]>

I'm not aware of the development history of the HADCM3S WU's but is there a set of bad parameters still in some of these?
ID: 50123 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50125 - Posted: 11 Sep 2014, 14:34:25 UTC

There is a problem with these work units on some Windows machines. I don't think anyone has worked out why some machines and not others yet though stopping machines rather than letting them run 24/7 to completion increases the chances of them crashing out. They seem to be rock solid on linux machines, surviving even power outages. This is also discussed in another thread.
ID: 50125 · Report as offensive     Reply Quote
Andrew Sanchez
Avatar

Send message
Joined: 28 May 14
Posts: 34
Credit: 705,936
RAC: 0
Message 50126 - Posted: 11 Sep 2014, 14:48:54 UTC

The last 2 wus i received said "computation error". I don't think they even started running, it seemed like a problem with the download. If someone wants to tell me how to get the details of the error i will try and post them here. Go easy on me though, i'm not a computer wiz, explain step by step.
ID: 50126 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50127 - Posted: 11 Sep 2014, 14:49:19 UTC - in response to Message 50125.  

Yes, I'd noticed that issue when searching through for info and if it hadn't been that mine had successfully completed some of these HADCM3S's in the batch loaded onto the server at the end of August, I would have put it down to that problem and loaded up the Linux Virtualbox on the same machine to see what happened.

The same 4 WU's which failed on mine today had multi failures on other users machines also and all but one of those are Windows systems. One is a Linux system here but that seems to have an issue as many other WU's had failed on it too.
ID: 50127 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50128 - Posted: 11 Sep 2014, 14:55:31 UTC - in response to Message 50126.  

The last 2 wus i received said "computation error". I don't think they even started running, it seemed like a problem with the download. If someone wants to tell me how to get the details of the error i will try and post them here. Go easy on me though, i'm not a computer wiz, explain step by step.


Yours has the same error. If you click on the relevant Task ID details on your PC Task list, then click on the '+' box after 'Stderr', it will open the Stderr list and show the details. Here is on of yours.

HTH
ID: 50128 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 50132 - Posted: 11 Sep 2014, 19:43:06 UTC

Welcome back, Pete! Good to see you here again.

My machines, all various Intel quads with various Windoze OS, have been among the "lucky" ones. They completed hundreds of HadCM3s tasks with occasional crashes (after ~8 seconds), single digit percentage. Apparently, part of the release was misconfigured by the scientist(s). Luck of the draw for some of us.

In addition to the ongoing 32-bit library problem with 64-bit Linux installations, there was something about boinc 'service' installations in Windows. Not sure whether the latter was a problem for HadCM3s tasks. (Dodgy old memory...)

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 50132 · Report as offensive     Reply Quote
Andrew Sanchez
Avatar

Send message
Joined: 28 May 14
Posts: 34
Credit: 705,936
RAC: 0
Message 50145 - Posted: 12 Sep 2014, 13:09:15 UTC - in response to Message 50128.  
Last modified: 12 Sep 2014, 13:10:29 UTC

core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)
</message>
<stderr_txt>
MonID=19054, UMPID=19059, RM3PID=0

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
MonID=19054, UMPID=19061, RM3PID=0

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
MonID=19054, UMPID=19063, RM3PID=0

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
MonID=19054, UMPID=19065, RM3PID=0

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
MonID=19054, UMPID=19070, RM3PID=0

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
MonID=19054, UMPID=19072, RM3PID=0

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
Sorry, too many model crashes! :-(
14:18:00 (19054): called boinc_finish

</stderr_txt>
]]>

Ah ok. Thanks for showing me how to find that info. So this is a problem with the models, not my pc?
ID: 50145 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,841,902
RAC: 5,047
Message 50147 - Posted: 12 Sep 2014, 16:51:52 UTC - in response to Message 50145.  

... Model crashed: INITTIME: Atmosphere basis time mismatch ... So this is a problem with the models, not my pc?

Yes. It's a model configuration error: every model in the work unit should fail in the same way.
ID: 50147 · Report as offensive     Reply Quote
Andrew Sanchez
Avatar

Send message
Joined: 28 May 14
Posts: 34
Credit: 705,936
RAC: 0
Message 50150 - Posted: 12 Sep 2014, 18:59:59 UTC - in response to Message 50147.  

I see. Well, the task i just got seems to be running fine right now, i hope it stays that way. XD
ID: 50150 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 50153 - Posted: 12 Sep 2014, 23:32:25 UTC

My machines are now bitten by the:
Model crashed: INITTIME: Atmosphere basis time mismatch
bug.

However, I'm unable to pin it to particular model years -- for example, many 1991 tasks fail but some run okay. Curious.

Perhaps Andrew Sanchez and I are seeing similar effects.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 50153 · Report as offensive     Reply Quote
Andrew Sanchez
Avatar

Send message
Joined: 28 May 14
Posts: 34
Credit: 705,936
RAC: 0
Message 50156 - Posted: 13 Sep 2014, 4:10:50 UTC - in response to Message 50153.  
Last modified: 13 Sep 2014, 4:12:19 UTC

I spoke too son. :/ The task that i thought was going ok erred.

core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
The device does not recognize the command.
(0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
Sorry, too many model crashes! :-(
06:08:08 (1616): called boinc_finish

</stderr_txt>
]

i'm having the same problem as i was with the previous tasks.
ID: 50156 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50157 - Posted: 13 Sep 2014, 7:58:58 UTC - in response to Message 50132.  

Welcome back, Pete! Good to see you here again.

My machines, all various Intel quads with various Windoze OS, have been among the "lucky" ones. They completed hundreds of HadCM3s tasks with occasional crashes (after ~8 seconds), single digit percentage. Apparently, part of the release was misconfigured by the scientist(s). Luck of the draw for some of us.

In addition to the ongoing 32-bit library problem with 64-bit Linux installations, there was something about boinc 'service' installations in Windows. Not sure whether the latter was a problem for HadCM3s tasks. (Dodgy old memory...)


Hi Jim, good to be back and see some interesting real recent situation analyses being done now. I just found over the last 3 years with other things going on and a long daily work commute that began back then, I no longer had the time to play a part in the model testing and analysis and didn't want to crunch just for credits sake.

The Windows v Linux issue with these CM3S's is an interesting one. The ones that failed on mine seemed to be failing on every other machine also but they were all Windows OS with one exception. The Linux exception did not look representative though as it had crashed many other WU's also.

Now, after one CM3S crash yesterday, I then got a WU that is holding and is now more than 50% complete so there seems to be more than just a Windows OS issue with them.
ID: 50157 · Report as offensive     Reply Quote
brown

Send message
Joined: 24 Feb 06
Posts: 10
Credit: 10,142,658
RAC: 0
Message 50158 - Posted: 13 Sep 2014, 9:34:40 UTC - in response to Message 50157.  

I have also found that I am getting compute errors on about 75% of these new work units. Although 25% do seem to complete. On those that fail, they fail immediately and I have noticed that they have failed when others have attempted them also over and over.
I am on win 7 64.

ID: 50158 · Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 14 Aug 06
Posts: 22
Credit: 6,525,043
RAC: 8,826
Message 50165 - Posted: 14 Sep 2014, 4:26:08 UTC

My last 41 tasks failed due to "computer error" at between approximately 25 to 70 seconds on an Intel i7 64 causing me to hit the "no new tasks" button about 10 hours ago.

An Intel i3 64 machine is processing 4 tasks simultaneously and has been for days on end without failure but the tasks in the hopper were downloaded days ago.This machine is normally operated about 12 hours daily because of heat consideration so I have many weeks in the future to process the older CM3s'.

Obviously I am repetitious in saying there is a serious problem with the newer tasks.

Bill:, Austin, Texas USA
ID: 50165 · Report as offensive     Reply Quote
Jonathan Miller

Send message
Joined: 27 Jul 12
Posts: 21
Credit: 269,602
RAC: 0
Message 50171 - Posted: 15 Sep 2014, 8:03:02 UTC

Hi Pete,

I spoke to the scientists on Friday, and they said that they had probably made a mistake with the parameters. There are 5 sets of parameters for each restart file, and they think that 3 or possibly four out of 5 may suffer from this error.

I will speak to Andy and the scientists about recalling and re-issuing this batch.

Jonathan

CPDN Sys-Admin
ID: 50171 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50172 - Posted: 15 Sep 2014, 10:01:58 UTC - in response to Message 50171.  
Last modified: 15 Sep 2014, 10:08:49 UTC

Hi Jonathan

Thanks for the update, that explains it.

My system successfully completed the 6th from the current batch that it downloaded on Friday following 5 crashes over Thu/Fri. It has then crashed another 10 since, all with the same error. I have now deselected the HADCM3S WU's for the time being until everything is OK. I'll just wait to see if I'll be lucky and find one of the odd ones occasionally returned from the other WU sets.
ID: 50172 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50203 - Posted: 16 Sep 2014, 11:16:46 UTC

My latest CM3S, the first one I downloaded from the smaller (older?) set on the server late last night, presumably after the troublesome ones had been removed for correction, is holding up and running well.
ID: 50203 · Report as offensive     Reply Quote

Message boards : Number crunching : Latest HADCM3S WU's Crashing

©2024 cpdn.org