Message boards : Number crunching : HadCM3 short - errors galore
Message board moderation
Author | Message |
---|---|
Send message Joined: 4 Sep 04 Posts: 1 Credit: 144,946 RAC: 0 |
I am not alone, in fact everyone gets errors in Workunit ID: 9128168 9130447 9056558 9129138 9128812 Regards Tommy |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I am not alone, in fact everyone gets errors in Model crashed: INITTIME: Atmosphere basis time mismatch Looking at the first work unit you quote, they all have the above error which I think from memory is some sort of configuration error in the tasks. At least they only run for 30 seconds or so before falling over. I also noted on the first one that they were failing on both windows and linux machines. I am sure the rest are similar though I didn't check them. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Errors Galore, is she any relation to Pussy Galore? ;) |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Or He? |
Send message Joined: 31 Mar 05 Posts: 44 Credit: 234,235 RAC: 0 |
Having given up on CPDN (temporarily, I hope), I tried running SETI. Same problem! Is all of BOINc infected? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
seti would be a different problem. The cm3s problems may be another manifestation of exe file getting truncated by virus checker. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
[Bellator wrote:]Having given up on CPDN (temporarily, I hope), I tried running SETI. Same problem! Is all of BOINc infected? Your HADCM3S models show as "aborted by user", though one does report a download problem. The models that were running appear to have been running well. Why did you abort them? |
Send message Joined: 31 Aug 05 Posts: 20 Credit: 1,969,695 RAC: 0 |
My last five WUs were "UK Met Office HadCM3 short v7.24" and they all suffered a "computation error" after running about 45 minutes with CPU time shown as only a few seconds. (On my machine "LEPC") I can't find any error messages -- but maybe I don't know where to look. The only thing that looks like an error are several messages of the form "23-Sep-2014 23:13:40 [climateprediction.net] Output file hadcm3s_2bi1_2001_2_009022880_4_2.zip for task hadcm3s_2bi1_2001_2_009022880_4 absent" in the file "stdoutae.txt" located in ". . . /Application Date/BOINC". And for the record, I have NO PROBLEMS running Astroids, LHC, World Community or Cosmology CPU tasks or Seti, Einstein or Milkyway GPU tasks. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Error messages for each model, are on the server page for each of the models. So ... Go to your account page. To get there: A) Click on your name alongside one of your posts, or B) Click on Your account in the blue menu to the left, or C) Click on the Your account button in the tasks tab of your BOINC manager. Towards the bottom of the screenfull of data is: Tasks, with View to the right of there. Click on View. On the next page, the first column is the names of the tasks/models that you've run. Click on the one in which you're interested. Go down to Stderr, and click on the + symbol to expand the list. ************ Output file absent ...means that the model never got to the point where it created that file. But BOINC has a list of files that it's supposed to upload, so IT'S the one complaining there. And for the record ...Only relevant if those projects also compile their programs with FORTRAN. :) |
Send message Joined: 31 Aug 05 Posts: 20 Credit: 1,969,695 RAC: 0 |
Error messages for each model, are on the server page for each of the models. Thanks for the info. All of the crashed WUs show similar error info: <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( 10:28:25 (3708): called boinc_finish </stderr_txt> ]]> I hope this is useful information. In the meantime, I am not accepting any more "climateprediction.net" work. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( 10:28:25 (3708): called boinc_finish </stderr_txt> ]]>[/i] I hope this is useful information. In the meantime, I am not accepting any more "climateprediction.net" work. [/quote] It is useful. �Invalid Theta� means that there was an internal problem in the WU itself (such as producing an unrealistic climate i.e. the oceans boil off or the Earths surface melts). The good news is that it means there is nothing wrong with your computer and you did nothing wrong while running to cause the crash. |
Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,013,293 RAC: 392 |
Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Strangely, I picked up a WU that had crashed with lots of "Model crashed: ATM_DYN : INVALID THETA DETECTED." on another computer, and mine completed it without problem. Luck? Professor Desty Nova Researching Karma the Hard Way |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Possibly different processors (Intel/AMD). Their maths routines are slightly different, so they produce slightly different results. If the researchers study the results down fine enough, that would give them even more info. i.e. That data set is VERY borderline. |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424 |
Strangely, I picked up a WU that had crashed with lots of "Model crashed: ATM_DYN : INVALID THETA DETECTED." on another computer, and mine completed it without problem. Luck? In the days when we used to run 200 year long HADCM3 models that could take a month or more to get through, it was usual to make backups fairly regularly so if a run crashed, it could be re-run from a short time before the crash. Often, it would crash again with the same fault, usually the one quoted above, indicating close to the edge parameter sets. I used to be running both Intel and AMD CPU machines and there was more than one occasion where repeated crashes on one CPU could be got through by transferring the model (the complete BOINC backup) to the other CPU machine, running it past the crash point successfully and either continuing on that machine or transferring it back to the other and finishing successfully. As Les says, hopefully, that kind of situation would have given the researchers some valuable data about the state of that particular model and parameter sets used. I tried to look at your models to pick out the one you were referring to and see what it had crashed on before and what you were running but your computer/s is/are hidden so it couldn't be done. |
Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,013,293 RAC: 392 |
It's this one: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9164996 I guess I've been lucky with my computers. Most of my crashes are because of power failures (when the batches are not faulty). I now make backups when climateprediction have longer models, that are more susceptible to sudden shut down. Professor Desty Nova Researching Karma the Hard Way |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424 |
It's this one: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9164996 One of the other PCs seems to have crashed just about everything (wrongly set up) and should probably be flagged up as a misconfigured PC. The other PC seems to have crashed CM3S's, even during periods the batches generally run well, but has completed other WU's. From what I've read on here and looking at the crash details, it's probably not being run continuously and this WU doesn't seem to like being stopped. Each one has crashed after different run times so the actual PC setup is probably OK. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Interested to know if anyone else is getting this with the short models. I noticed disk usage seemed to be getting a bit high for BOINC and on checking, the last 4 short models hadn't cleaned up after themselves though they had sent all zips and cleared from Tasks In Progress view. If others have had this is it only on nix boxen or a global issue? Edit: Just to be completely clear, this is not crashed tasks leaving their detritus on my disk which I know is a problem but models that have finished without a hitch other than having to wait to report/upload zips on some occasions. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Also interested to know if others running on linux have had short models clear up as normal and if so are they using packaged BOINC or not. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Interested to know if anyone else is getting this with the short models. I noticed disk usage seemed to be getting a bit high for BOINC and on checking, the last 4 short models hadn't cleaned up after themselves though they had sent all zips and cleared from Tasks In Progress view. If others have had this is it only on nix boxen or a global issue? Most of the hadcm3s (short) models complete ok on my ubuntu trusty machines (and the one #! wheezy machine. The models that complete and upload OK always leave about 800 megabytes behind. The ones that fail leave about 450 megabytes behind. The ones that fail download leave nothing behind :) |
©2024 cpdn.org