Message boards : Number crunching : Latest HADCM3S WU's Crashing
Message board moderation
Author | Message |
---|---|
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424 |
I've just returned to the Project about a month ago after a 3 year absence from it. Using, by todays standard a relatively outdated PC till I get round to building another, I've run a few regionals for PNW and now running some ANZ's. Also completed without problem some HADCM3S's a week or so back but from the latest batch of those WU's put on the server yesterday, my machine has rapidly crashed 4. Checking the history of the WU's, all 4 seem to have suffered crashes on multiple machines, some of which have run many other types to completion. My failure mode is shown: <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: INITTIME: Atmosphere basis time mismatch Model crashed: INITTIME: Atmosphere basis time mismatch Model crashed: INITTIME: Atmosphere basis time mismatch Model crashed: INITTIME: Atmosphere basis time mismatch Model crashed: INITTIME: Atmosphere basis time mismatch Model crashed: INITTIME: Atmosphere basis time mismatch Sorry, too many model crashes! :-( 09:25:21 (3020): called boinc_finish </stderr_txt> ]]> I'm not aware of the development history of the HADCM3S WU's but is there a set of bad parameters still in some of these? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
There is a problem with these work units on some Windows machines. I don't think anyone has worked out why some machines and not others yet though stopping machines rather than letting them run 24/7 to completion increases the chances of them crashing out. They seem to be rock solid on linux machines, surviving even power outages. This is also discussed in another thread. |
Send message Joined: 28 May 14 Posts: 34 Credit: 705,936 RAC: 0 |
The last 2 wus i received said "computation error". I don't think they even started running, it seemed like a problem with the download. If someone wants to tell me how to get the details of the error i will try and post them here. Go easy on me though, i'm not a computer wiz, explain step by step. |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424 |
Yes, I'd noticed that issue when searching through for info and if it hadn't been that mine had successfully completed some of these HADCM3S's in the batch loaded onto the server at the end of August, I would have put it down to that problem and loaded up the Linux Virtualbox on the same machine to see what happened. The same 4 WU's which failed on mine today had multi failures on other users machines also and all but one of those are Windows systems. One is a Linux system here but that seems to have an issue as many other WU's had failed on it too. |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424 |
The last 2 wus i received said "computation error". I don't think they even started running, it seemed like a problem with the download. If someone wants to tell me how to get the details of the error i will try and post them here. Go easy on me though, i'm not a computer wiz, explain step by step. Yours has the same error. If you click on the relevant Task ID details on your PC Task list, then click on the '+' box after 'Stderr', it will open the Stderr list and show the details. Here is on of yours. HTH |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Welcome back, Pete! Good to see you here again. My machines, all various Intel quads with various Windoze OS, have been among the "lucky" ones. They completed hundreds of HadCM3s tasks with occasional crashes (after ~8 seconds), single digit percentage. Apparently, part of the release was misconfigured by the scientist(s). Luck of the draw for some of us. In addition to the ongoing 32-bit library problem with 64-bit Linux installations, there was something about boinc 'service' installations in Windows. Not sure whether the latter was a problem for HadCM3s tasks. (Dodgy old memory...) "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 28 May 14 Posts: 34 Credit: 705,936 RAC: 0 |
core_client_version>7.2.42</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234) </message> <stderr_txt> MonID=19054, UMPID=19059, RM3PID=0 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 MonID=19054, UMPID=19061, RM3PID=0 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 MonID=19054, UMPID=19063, RM3PID=0 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 MonID=19054, UMPID=19065, RM3PID=0 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 MonID=19054, UMPID=19070, RM3PID=0 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 MonID=19054, UMPID=19072, RM3PID=0 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( 14:18:00 (19054): called boinc_finish </stderr_txt> ]]> Ah ok. Thanks for showing me how to find that info. So this is a problem with the models, not my pc? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,826,970 RAC: 5,066 |
... Model crashed: INITTIME: Atmosphere basis time mismatch ... So this is a problem with the models, not my pc? Yes. It's a model configuration error: every model in the work unit should fail in the same way. |
Send message Joined: 28 May 14 Posts: 34 Credit: 705,936 RAC: 0 |
I see. Well, the task i just got seems to be running fine right now, i hope it stays that way. XD |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
My machines are now bitten by the: Model crashed: INITTIME: Atmosphere basis time mismatchbug. However, I'm unable to pin it to particular model years -- for example, many 1991 tasks fail but some run okay. Curious. Perhaps Andrew Sanchez and I are seeing similar effects. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 28 May 14 Posts: 34 Credit: 705,936 RAC: 0 |
I spoke too son. :/ The task that i thought was going ok erred. core_client_version>7.2.42</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( 06:08:08 (1616): called boinc_finish </stderr_txt> ] i'm having the same problem as i was with the previous tasks. |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424 |
Welcome back, Pete! Good to see you here again. Hi Jim, good to be back and see some interesting real recent situation analyses being done now. I just found over the last 3 years with other things going on and a long daily work commute that began back then, I no longer had the time to play a part in the model testing and analysis and didn't want to crunch just for credits sake. The Windows v Linux issue with these CM3S's is an interesting one. The ones that failed on mine seemed to be failing on every other machine also but they were all Windows OS with one exception. The Linux exception did not look representative though as it had crashed many other WU's also. Now, after one CM3S crash yesterday, I then got a WU that is holding and is now more than 50% complete so there seems to be more than just a Windows OS issue with them. |
Send message Joined: 24 Feb 06 Posts: 10 Credit: 10,142,658 RAC: 0 |
I have also found that I am getting compute errors on about 75% of these new work units. Although 25% do seem to complete. On those that fail, they fail immediately and I have noticed that they have failed when others have attempted them also over and over. I am on win 7 64. |
Send message Joined: 14 Aug 06 Posts: 22 Credit: 6,514,274 RAC: 10,511 |
My last 41 tasks failed due to "computer error" at between approximately 25 to 70 seconds on an Intel i7 64 causing me to hit the "no new tasks" button about 10 hours ago. An Intel i3 64 machine is processing 4 tasks simultaneously and has been for days on end without failure but the tasks in the hopper were downloaded days ago.This machine is normally operated about 12 hours daily because of heat consideration so I have many weeks in the future to process the older CM3s'. Obviously I am repetitious in saying there is a serious problem with the newer tasks. Bill:, Austin, Texas USA |
Send message Joined: 27 Jul 12 Posts: 21 Credit: 269,602 RAC: 0 |
Hi Pete, I spoke to the scientists on Friday, and they said that they had probably made a mistake with the parameters. There are 5 sets of parameters for each restart file, and they think that 3 or possibly four out of 5 may suffer from this error. I will speak to Andy and the scientists about recalling and re-issuing this batch. Jonathan CPDN Sys-Admin |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424 |
Hi Jonathan Thanks for the update, that explains it. My system successfully completed the 6th from the current batch that it downloaded on Friday following 5 crashes over Thu/Fri. It has then crashed another 10 since, all with the same error. I have now deselected the HADCM3S WU's for the time being until everything is OK. I'll just wait to see if I'll be lucky and find one of the odd ones occasionally returned from the other WU sets. |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424 |
My latest CM3S, the first one I downloaded from the smaller (older?) set on the server late last night, presumably after the troublesome ones had been removed for correction, is holding up and running well. |
©2024 cpdn.org