climateprediction.net (CPDN) home page
Thread 'HadCM3 short - errors galore'

Thread 'HadCM3 short - errors galore'

Message boards : Number crunching : HadCM3 short - errors galore
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Bektor

Send message
Joined: 4 Sep 04
Posts: 1
Credit: 144,946
RAC: 0
Message 50279 - Posted: 23 Sep 2014, 21:55:33 UTC

I am not alone, in fact everyone gets errors in
Workunit ID:
9128168
9130447
9056558
9129138
9128812


Regards
Tommy
ID: 50279 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50285 - Posted: 24 Sep 2014, 6:23:44 UTC - in response to Message 50279.  

I am not alone, in fact everyone gets errors in
Workunit ID:
9128168



Model crashed: INITTIME: Atmosphere basis time mismatch

Looking at the first work unit you quote, they all have the above error which I think from memory is some sort of configuration error in the tasks. At least they only run for 30 seconds or so before falling over. I also noted on the first one that they were failing on both windows and linux machines. I am sure the rest are similar though I didn't check them.
ID: 50285 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 50293 - Posted: 24 Sep 2014, 12:51:15 UTC

Errors Galore, is she any relation to Pussy Galore? ;)

ID: 50293 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50294 - Posted: 24 Sep 2014, 13:06:17 UTC - in response to Message 50293.  

Or He?
ID: 50294 · Report as offensive     Reply Quote
Bellator
Avatar

Send message
Joined: 31 Mar 05
Posts: 44
Credit: 234,235
RAC: 0
Message 50295 - Posted: 24 Sep 2014, 14:28:01 UTC

Having given up on CPDN (temporarily, I hope), I tried running SETI. Same problem! Is all of BOINc infected?
ID: 50295 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50297 - Posted: 24 Sep 2014, 14:33:34 UTC - in response to Message 50295.  

seti would be a different problem. The cm3s problems may be another manifestation of exe file getting truncated by virus checker.
ID: 50297 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,825,314
RAC: 4,915
Message 50299 - Posted: 24 Sep 2014, 17:55:10 UTC - in response to Message 50295.  

[Bellator wrote:]Having given up on CPDN (temporarily, I hope), I tried running SETI. Same problem! Is all of BOINc infected?

Your HADCM3S models show as "aborted by user", though one does report a download problem. The models that were running appear to have been running well. Why did you abort them?
ID: 50299 · Report as offensive     Reply Quote
w1hue

Send message
Joined: 31 Aug 05
Posts: 20
Credit: 1,969,695
RAC: 0
Message 50302 - Posted: 24 Sep 2014, 22:00:39 UTC
Last modified: 24 Sep 2014, 22:01:45 UTC

My last five WUs were "UK Met Office HadCM3 short v7.24" and they all suffered a "computation error" after running about 45 minutes with CPU time shown as only a few seconds. (On my machine "LEPC")

I can't find any error messages -- but maybe I don't know where to look. The only thing that looks like an error are several messages of the form "23-Sep-2014 23:13:40 [climateprediction.net] Output file hadcm3s_2bi1_2001_2_009022880_4_2.zip for task hadcm3s_2bi1_2001_2_009022880_4 absent" in the file "stdoutae.txt" located in ". . . /Application Date/BOINC".

And for the record, I have NO PROBLEMS running Astroids, LHC, World Community or Cosmology CPU tasks or Seti, Einstein or Milkyway GPU tasks.
ID: 50302 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50303 - Posted: 24 Sep 2014, 23:24:52 UTC - in response to Message 50302.  

Error messages for each model, are on the server page for each of the models.

So ...
Go to your account page.

To get there:
A) Click on your name alongside one of your posts, or
B) Click on Your account in the blue menu to the left, or
C) Click on the Your account button in the tasks tab of your BOINC manager.

Towards the bottom of the screenfull of data is:
Tasks, with View to the right of there.
Click on View.

On the next page, the first column is the names of the tasks/models that you've run.
Click on the one in which you're interested.
Go down to Stderr, and click on the + symbol to expand the list.

************

Output file absent ...
means that the model never got to the point where it created that file. But BOINC has a list of files that it's supposed to upload, so IT'S the one complaining there.

And for the record ...
Only relevant if those projects also compile their programs with FORTRAN. :)


ID: 50303 · Report as offensive     Reply Quote
w1hue

Send message
Joined: 31 Aug 05
Posts: 20
Credit: 1,969,695
RAC: 0
Message 50314 - Posted: 26 Sep 2014, 4:53:03 UTC - in response to Message 50303.  

Error messages for each model, are on the server page for each of the models.
. . .

Thanks for the info.

All of the crashed WUs show similar error info:

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
The device does not recognize the command.
(0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048
Sorry, too many model crashes! :-(
10:28:25 (3708): called boinc_finish

</stderr_txt>
]]>


I hope this is useful information. In the meantime, I am not accepting any more "climateprediction.net" work.

ID: 50314 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 50315 - Posted: 26 Sep 2014, 6:26:37 UTC - in response to Message 50314.  

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048
Sorry, too many model crashes! :-(
10:28:25 (3708): called boinc_finish

</stderr_txt>
]]>[/i]

I hope this is useful information. In the meantime, I am not accepting any more "climateprediction.net" work.
[/quote]

It is useful. �Invalid Theta� means that there was an internal problem in the WU itself (such as producing an unrealistic climate i.e. the oceans boil off or the Earths surface melts). The good news is that it means there is nothing wrong with your computer and you did nothing wrong while running to cause the crash.


ID: 50315 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 2,013,293
RAC: 392
Message 50316 - Posted: 26 Sep 2014, 6:41:11 UTC - in response to Message 50315.  
Last modified: 26 Sep 2014, 6:44:07 UTC

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048
Sorry, too many model crashes! :-(
10:28:25 (3708): called boinc_finish

</stderr_txt>
]]>[/i]

I hope this is useful information. In the meantime, I am not accepting any more "climateprediction.net" work.


It is useful. �Invalid Theta� means that there was an internal problem in the WU itself (such as producing an unrealistic climate i.e. the oceans boil off or the Earths surface melts). The good news is that it means there is nothing wrong with your computer and you did nothing wrong while running to cause the crash.



Strangely, I picked up a WU that had crashed with lots of "Model crashed: ATM_DYN : INVALID THETA DETECTED." on another computer, and mine completed it without problem. Luck?


Professor Desty Nova
Researching Karma the Hard Way
ID: 50316 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50318 - Posted: 26 Sep 2014, 8:02:55 UTC - in response to Message 50316.  

Possibly different processors (Intel/AMD).
Their maths routines are slightly different, so they produce slightly different results.

If the researchers study the results down fine enough, that would give them even more info. i.e. That data set is VERY borderline.


ID: 50318 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50319 - Posted: 26 Sep 2014, 8:19:25 UTC - in response to Message 50316.  

Strangely, I picked up a WU that had crashed with lots of "Model crashed: ATM_DYN : INVALID THETA DETECTED." on another computer, and mine completed it without problem. Luck?


In the days when we used to run 200 year long HADCM3 models that could take a month or more to get through, it was usual to make backups fairly regularly so if a run crashed, it could be re-run from a short time before the crash. Often, it would crash again with the same fault, usually the one quoted above, indicating close to the edge parameter sets.

I used to be running both Intel and AMD CPU machines and there was more than one occasion where repeated crashes on one CPU could be got through by transferring the model (the complete BOINC backup) to the other CPU machine, running it past the crash point successfully and either continuing on that machine or transferring it back to the other and finishing successfully.

As Les says, hopefully, that kind of situation would have given the researchers some valuable data about the state of that particular model and parameter sets used.

I tried to look at your models to pick out the one you were referring to and see what it had crashed on before and what you were running but your computer/s is/are hidden so it couldn't be done.
ID: 50319 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 2,013,293
RAC: 392
Message 50320 - Posted: 26 Sep 2014, 9:10:27 UTC
Last modified: 26 Sep 2014, 9:16:39 UTC

It's this one: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9164996

I guess I've been lucky with my computers. Most of my crashes are because of power failures (when the batches are not faulty). I now make backups when climateprediction have longer models, that are more susceptible to sudden shut down.


Professor Desty Nova
Researching Karma the Hard Way
ID: 50320 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50322 - Posted: 26 Sep 2014, 10:16:09 UTC - in response to Message 50320.  

It's this one: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9164996

I guess I've been lucky with my computers. Most of my crashes are because of power failures (when the batches are not faulty). I now make backups when climateprediction have longer models, that are more susceptible to sudden shut down.


One of the other PCs seems to have crashed just about everything (wrongly set up) and should probably be flagged up as a misconfigured PC.

The other PC seems to have crashed CM3S's, even during periods the batches generally run well, but has completed other WU's. From what I've read on here and looking at the crash details, it's probably not being run continuously and this WU doesn't seem to like being stopped. Each one has crashed after different run times so the actual PC setup is probably OK.
ID: 50322 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50324 - Posted: 26 Sep 2014, 10:56:26 UTC
Last modified: 26 Sep 2014, 10:57:56 UTC

Interested to know if anyone else is getting this with the short models. I noticed disk usage seemed to be getting a bit high for BOINC and on checking, the last 4 short models hadn't cleaned up after themselves though they had sent all zips and cleared from Tasks In Progress view. If others have had this is it only on nix boxen or a global issue?

Edit: Just to be completely clear, this is not crashed tasks leaving their detritus on my disk which I know is a problem but models that have finished without a hitch other than having to wait to report/upload zips on some occasions.
ID: 50324 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,825,314
RAC: 4,915
Message 50325 - Posted: 26 Sep 2014, 12:42:38 UTC - in response to Message 50314.  

[w1hue wrote:]Model crashed: ATM_DYN : INVALID THETA DETECTED.

FYI: Potential temperature
ID: 50325 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50326 - Posted: 26 Sep 2014, 13:33:41 UTC - in response to Message 50324.  

Also interested to know if others running on linux have had short models clear up as normal and if so are they using packaged BOINC or not.
ID: 50326 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 50327 - Posted: 27 Sep 2014, 3:02:12 UTC - in response to Message 50324.  

Interested to know if anyone else is getting this with the short models. I noticed disk usage seemed to be getting a bit high for BOINC and on checking, the last 4 short models hadn't cleaned up after themselves though they had sent all zips and cleared from Tasks In Progress view. If others have had this is it only on nix boxen or a global issue?

Edit: Just to be completely clear, this is not crashed tasks leaving their detritus on my disk which I know is a problem but models that have finished without a hitch other than having to wait to report/upload zips on some occasions.


Most of the hadcm3s (short) models complete ok on my ubuntu trusty machines (and the one #! wheezy machine.
The models that complete and upload OK always leave about 800 megabytes behind.
The ones that fail leave about 450 megabytes behind.
The ones that fail download leave nothing behind :)

ID: 50327 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : Number crunching : HadCM3 short - errors galore

©2024 cpdn.org