Thread 'HadCM3 short - errors galore'

Author	Message
Bektor Send message Joined: 4 Sep 04 Posts: 1 Credit: 144,946 RAC: 0	Message 50279 - Posted: 23 Sep 2014, 21:55:33 UTC I am not alone, in fact everyone gets errors in Workunit ID: 9128168 9130447 9056558 9129138 9128812 Regards Tommy ID: 50279 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50285 - Posted: 24 Sep 2014, 6:23:44 UTC - in response to Message 50279. I am not alone, in fact everyone gets errors in Workunit ID: 9128168 Model crashed: INITTIME: Atmosphere basis time mismatch Looking at the first work unit you quote, they all have the above error which I think from memory is some sort of configuration error in the tasks. At least they only run for 30 seconds or so before falling over. I also noted on the first one that they were failing on both windows and linux machines. I am sure the rest are similar though I didn't check them. ID: 50285 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 50293 - Posted: 24 Sep 2014, 12:51:15 UTC Errors Galore, is she any relation to Pussy Galore? ;) ID: 50293 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50294 - Posted: 24 Sep 2014, 13:06:17 UTC - in response to Message 50293. Or He? ID: 50294 · Reply Quote

Bellator Send message Joined: 31 Mar 05 Posts: 44 Credit: 234,235 RAC: 0	Message 50295 - Posted: 24 Sep 2014, 14:28:01 UTC Having given up on CPDN (temporarily, I hope), I tried running SETI. Same problem! Is all of BOINc infected? ID: 50295 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50297 - Posted: 24 Sep 2014, 14:33:34 UTC - in response to Message 50295. seti would be a different problem. The cm3s problems may be another manifestation of exe file getting truncated by virus checker. ID: 50297 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038	Message 50299 - Posted: 24 Sep 2014, 17:55:10 UTC - in response to Message 50295. [Bellator wrote:]Having given up on CPDN (temporarily, I hope), I tried running SETI. Same problem! Is all of BOINc infected? Your HADCM3S models show as "aborted by user", though one does report a download problem. The models that were running appear to have been running well. Why did you abort them? ID: 50299 · Reply Quote

w1hue Send message Joined: 31 Aug 05 Posts: 20 Credit: 1,969,695 RAC: 0	Message 50302 - Posted: 24 Sep 2014, 22:00:39 UTC Last modified: 24 Sep 2014, 22:01:45 UTC My last five WUs were "UK Met Office HadCM3 short v7.24" and they all suffered a "computation error" after running about 45 minutes with CPU time shown as only a few seconds. (On my machine "LEPC") I can't find any error messages -- but maybe I don't know where to look. The only thing that looks like an error are several messages of the form "23-Sep-2014 23:13:40 [climateprediction.net] Output file hadcm3s_2bi1_2001_2_009022880_4_2.zip for task hadcm3s_2bi1_2001_2_009022880_4 absent" in the file "stdoutae.txt" located in ". . . /Application Date/BOINC". And for the record, I have NO PROBLEMS running Astroids, LHC, World Community or Cosmology CPU tasks or Seti, Einstein or Milkyway GPU tasks. ID: 50302 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50303 - Posted: 24 Sep 2014, 23:24:52 UTC - in response to Message 50302. Error messages for each model, are on the server page for each of the models. So ... Go to your account page. To get there: A) Click on your name alongside one of your posts, or B) Click on Your account in the blue menu to the left, or C) Click on the Your account button in the tasks tab of your BOINC manager. Towards the bottom of the screenfull of data is: Tasks, with View to the right of there. Click on View. On the next page, the first column is the names of the tasks/models that you've run. Click on the one in which you're interested. Go down to Stderr, and click on the + symbol to expand the list. ************ Output file absent ... means that the model never got to the point where it created that file. But BOINC has a list of files that it's supposed to upload, so IT'S the one complaining there. And for the record ... Only relevant if those projects also compile their programs with FORTRAN. :) ID: 50303 · Reply Quote

w1hue Send message Joined: 31 Aug 05 Posts: 20 Credit: 1,969,695 RAC: 0	Message 50314 - Posted: 26 Sep 2014, 4:53:03 UTC - in response to Message 50303. Error messages for each model, are on the server page for each of the models. . . . Thanks for the info. All of the crashed WUs show similar error info: <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( 10:28:25 (3708): called boinc_finish </stderr_txt> ]]> I hope this is useful information. In the meantime, I am not accepting any more "climateprediction.net" work. ID: 50314 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 50315 - Posted: 26 Sep 2014, 6:26:37 UTC - in response to Message 50314. Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( 10:28:25 (3708): called boinc_finish </stderr_txt> ]]>[/i] I hope this is useful information. In the meantime, I am not accepting any more "climateprediction.net" work. [/quote] It is useful. �Invalid Theta� means that there was an internal problem in the WU itself (such as producing an unrealistic climate i.e. the oceans boil off or the Earths surface melts). The good news is that it means there is nothing wrong with your computer and you did nothing wrong while running to cause the crash. ID: 50315 · Reply Quote

Professor Desty Nova Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,013,293 RAC: 392	Message 50316 - Posted: 26 Sep 2014, 6:41:11 UTC - in response to Message 50315. Last modified: 26 Sep 2014, 6:44:07 UTC Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( 10:28:25 (3708): called boinc_finish </stderr_txt> ]]>[/i] I hope this is useful information. In the meantime, I am not accepting any more "climateprediction.net" work. It is useful. �Invalid Theta� means that there was an internal problem in the WU itself (such as producing an unrealistic climate i.e. the oceans boil off or the Earths surface melts). The good news is that it means there is nothing wrong with your computer and you did nothing wrong while running to cause the crash. Strangely, I picked up a WU that had crashed with lots of "Model crashed: ATM_DYN : INVALID THETA DETECTED." on another computer, and mine completed it without problem. Luck? Professor Desty Nova Researching Karma the Hard Way ID: 50316 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50318 - Posted: 26 Sep 2014, 8:02:55 UTC - in response to Message 50316. Possibly different processors (Intel/AMD). Their maths routines are slightly different, so they produce slightly different results. If the researchers study the results down fine enough, that would give them even more info. i.e. That data set is VERY borderline. ID: 50318 · Reply Quote

Pete B Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424	Message 50319 - Posted: 26 Sep 2014, 8:19:25 UTC - in response to Message 50316. Strangely, I picked up a WU that had crashed with lots of "Model crashed: ATM_DYN : INVALID THETA DETECTED." on another computer, and mine completed it without problem. Luck? In the days when we used to run 200 year long HADCM3 models that could take a month or more to get through, it was usual to make backups fairly regularly so if a run crashed, it could be re-run from a short time before the crash. Often, it would crash again with the same fault, usually the one quoted above, indicating close to the edge parameter sets. I used to be running both Intel and AMD CPU machines and there was more than one occasion where repeated crashes on one CPU could be got through by transferring the model (the complete BOINC backup) to the other CPU machine, running it past the crash point successfully and either continuing on that machine or transferring it back to the other and finishing successfully. As Les says, hopefully, that kind of situation would have given the researchers some valuable data about the state of that particular model and parameter sets used. I tried to look at your models to pick out the one you were referring to and see what it had crashed on before and what you were running but your computer/s is/are hidden so it couldn't be done. ID: 50319 · Reply Quote

Professor Desty Nova Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,013,293 RAC: 392	Message 50320 - Posted: 26 Sep 2014, 9:10:27 UTC Last modified: 26 Sep 2014, 9:16:39 UTC It's this one: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9164996 I guess I've been lucky with my computers. Most of my crashes are because of power failures (when the batches are not faulty). I now make backups when climateprediction have longer models, that are more susceptible to sudden shut down. Professor Desty Nova Researching Karma the Hard Way ID: 50320 · Reply Quote

Pete B Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424	Message 50322 - Posted: 26 Sep 2014, 10:16:09 UTC - in response to Message 50320. It's this one: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9164996 I guess I've been lucky with my computers. Most of my crashes are because of power failures (when the batches are not faulty). I now make backups when climateprediction have longer models, that are more susceptible to sudden shut down. One of the other PCs seems to have crashed just about everything (wrongly set up) and should probably be flagged up as a misconfigured PC. The other PC seems to have crashed CM3S's, even during periods the batches generally run well, but has completed other WU's. From what I've read on here and looking at the crash details, it's probably not being run continuously and this WU doesn't seem to like being stopped. Each one has crashed after different run times so the actual PC setup is probably OK. ID: 50322 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50324 - Posted: 26 Sep 2014, 10:56:26 UTC Last modified: 26 Sep 2014, 10:57:56 UTC Interested to know if anyone else is getting this with the short models. I noticed disk usage seemed to be getting a bit high for BOINC and on checking, the last 4 short models hadn't cleaned up after themselves though they had sent all zips and cleared from Tasks In Progress view. If others have had this is it only on nix boxen or a global issue? Edit: Just to be completely clear, this is not crashed tasks leaving their detritus on my disk which I know is a problem but models that have finished without a hitch other than having to wait to report/upload zips on some occasions. ID: 50324 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038	Message 50325 - Posted: 26 Sep 2014, 12:42:38 UTC - in response to Message 50314. [w1hue wrote:]Model crashed: ATM_DYN : INVALID THETA DETECTED. FYI: Potential temperature ID: 50325 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50326 - Posted: 26 Sep 2014, 13:33:41 UTC - in response to Message 50324. Also interested to know if others running on linux have had short models clear up as normal and if so are they using packaged BOINC or not. ID: 50326 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 50327 - Posted: 27 Sep 2014, 3:02:12 UTC - in response to Message 50324. Interested to know if anyone else is getting this with the short models. I noticed disk usage seemed to be getting a bit high for BOINC and on checking, the last 4 short models hadn't cleaned up after themselves though they had sent all zips and cleared from Tasks In Progress view. If others have had this is it only on nix boxen or a global issue? Edit: Just to be completely clear, this is not crashed tasks leaving their detritus on my disk which I know is a problem but models that have finished without a hitch other than having to wait to report/upload zips on some occasions. Most of the hadcm3s (short) models complete ok on my ubuntu trusty machines (and the one #! wheezy machine. The models that complete and upload OK always leave about 800 megabytes behind. The ones that fail leave about 450 megabytes behind. The ones that fail download leave nothing behind :) ID: 50327 · Reply Quote