climateprediction.net (CPDN) home page
Thread 'Reporting - Errors while computing -'

Thread 'Reporting - Errors while computing -'

Message boards : Number crunching : Reporting - Errors while computing -
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 45625 - Posted: 8 Mar 2013, 13:36:27 UTC

Like Les, I don't think the Africa models will come onto this main project for a while yet. The main researcher for this Africa project is Friederike Otto in Oxford. She's German and has been in the UK about five years. She's just very recently been told that the funding for the Africa project has come through. Myles and Friederike are planning a meeting soon in Nairobi to discuss the project with African researchers.

The ANZ research will as far as I know be carried out in Hobart, Tasmania. Myles has also been there to discuss this project with the Australians.

I get the impression that most of the research using the regional models concentrates on attribution studies trying to calculate whether and to what extent climate change is responsible for particular weather phenomena. I wouldn't be at all surprised if the Australians want to look at whether climate change has played a part in causing the atrocious hot summers and drought there in the last few years, or whether it's just freak random bad luck to be expected from time to time.

Thing have come a long way since the days when we only had one type of model and all the research was carried out in Oxford.
Cpdn news
ID: 45625 · Report as offensive     Reply Quote
old_user671679

Send message
Joined: 30 Jan 12
Posts: 38
Credit: 10,197,388
RAC: 0
Message 45626 - Posted: 8 Mar 2013, 22:40:16 UTC

Thanks guys, that brings allot of things in to focus. I'm looking forward to working on ANZ.
ID: 45626 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45631 - Posted: 10 Mar 2013, 11:25:08 UTC





hadcm3n_o44g_2140_40_008281590

I just down Loaded this model - UK Met Office Coupled Model Full Resolution Ocean v6.07 - year 2140

this Modle was Created 13 Jan 2013 23:46:25 UTC

as you can see the two other computer have both crashed this model after 20 Trickles

with a - - STWORK : I/O error - PP fixed length header - - error.

<core_client_version>7.0.28</core_client_version>

<![CDATA[

<message>
The device does not recognize the command. (0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

<Snip>
.....
</Snip>

Model crashed: STWORK : I/O error - PP fixed length header tmp/pipe_dummy 2048

BUFFIN: C I/O Error feof - Unit 64 - Return code = 16
BUFFIN: C I/O Error feof - Unit 66 - Return code = 16
BUFFIN: C I/O Error feof - Unit 67 - Return code = 16
BUFFIN: C I/O Error feof - Unit 68 - Return code = 16
BUFFIN: C I/O Error feof - Unit 69 - Return code = 16
BUFFIN: C I/O Error feof - Unit 67 - Return code = 16

<Snip>
.....
</Snip>

Sorry, too many model crashes! :-(
Called boinc_finish

</stderr_txt>
]]>

I have't started crunching this model yet

so should I abort this Model ? - hadcm3n_o44g_2140_40_008281590

also I'm just curious

what does the error - - - Model crashed: STWORK : I/O error - PP fixed length header tmp/pipe_dummy 2048 ... Mean ?

on this my fastest computer

I'm only running one project - only Climate Prediction.net - 24/7/365 - and no GPU - apps

8 physical CPU - no Hyper threading

So I now have 8 UK Met Office Coupled Model Full Resolution Ocean v6.07 - running now - 24/7 - verry nicely so far

one each for the years _1960_1940_1920_ and five each for the year_1880_

8 Models crunching nicely with no problems so far.






ID: 45631 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 45632 - Posted: 10 Mar 2013, 13:31:08 UTC

The 2140 models are crashing at 75%, if they haven't already crashed for some other reason at that point. I would definitely abort it. And, I'm not sure what the error means, but usually when a bad batch goes out, it has some problem with an ancillary file that is not setup right for the model to continue past whatever common point the models are crashing at.
ID: 45632 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 45633 - Posted: 10 Mar 2013, 15:09:49 UTC

This model doesn't belong to a whole defective batch; I've looked at several workunits before it and several after, and a good proportion are completing. However, your workunit isn't the only instance of this error:

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8432723

Here three tasks from the same WU have crashed at the same point, though all three computers seem to be decent model crunchers. (One of those computers has BOINC 5.10.something. Should we try to contact this member? I'm surprised that BOINC 5 can still crunch these models, though I bet the owner can't see the graphics.)

So I think the same thing would happen to your model, particularly as you have the same OS as the computers that have already had that crash.

I don't know what the error means except to say that when you see the model trying to recover 5 times then crashing the 6th time (as is the case here) it often means the problem lies within the model. The models are fortunately designed not to try to recover indefinitely. If you're watching the graphics when this sort of thing happens you see the globe window go black for a second. The model restarts from the last timestep, then the same thing happens again at exactly the same timestep. On the sixth crash it doesn't loop back and try again.

STWORK is in uppercase. This often indicates a fault within the model (cf REPLANCA).

This isn't very scientific I'm afraid, just things I've noticed.

I've found a very old file from the National Centre for Atmospheric Science which appears to be part of the design for the Unified Model (which is what all our models are based on) for the Met Office:

http://cms.ncas.ac.uk/code_browsers/UM4.5/UMbrowser/html_code/UM/STWORK1A.F.html

STWORK seems to introduce a subroutine and at the end there's a possible error message very similar to yours. But we mods aren't model programmers.
Cpdn news
ID: 45633 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 45635 - Posted: 10 Mar 2013, 16:52:41 UTC

Mo,

Every single 2140 task I've seen has crashed at 75% or before. Yes there are work units near the one that contains Byron's listed task, that haven't had all tasks crash, but they are not 2140 work units.

Byron, abort it if you haven't already.
ID: 45635 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 45636 - Posted: 10 Mar 2013, 22:26:52 UTC

You're right. The batch contains models for several different 40-year periods and it's just the 2140 WUs that generate this error.
Cpdn news
ID: 45636 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 45642 - Posted: 11 Mar 2013, 10:21:32 UTC - in response to Message 45633.  

... (One of those computers has BOINC 5.10.something. Should we try to contact this member? I'm surprised that BOINC 5 can still crunch these models, though I bet the owner can't see the graphics.) ...
There is at least one good reason for continuing with the 5-series BOINC, which is that it runs on some of Microsoft's server operating systems. However, this particular user has a range of machines with varying BOINC versions running on mostly Windows XP, with a couple of old-ish servers too. It appears to be a choice.
ID: 45642 · Report as offensive     Reply Quote
skgiven
Avatar

Send message
Joined: 5 Jun 06
Posts: 28
Credit: 2,790,048
RAC: 0
Message 45659 - Posted: 15 Mar 2013, 12:05:38 UTC - in response to Message 45642.  
Last modified: 15 Mar 2013, 12:33:15 UTC

hadcm3n_zl88_1960_40_008321064_0 8472199 24 Feb 2013 17:06:24 UTC 15 Mar 2013 0:48:30 UTC Completed 788,994.61 786,033.00 --- 11,819.52 UK Met Office Coupled Model Full Resolution Ocean v6.07
hadcm3n_3i54_1980_40_008320817_0 8471952 24 Feb 2013 16:05:43 UTC 15 Mar 2013 4:45:39 UTC Completed 794,894.29 597,384.40 --- 11,508.48 UK Met Office Coupled Model Full Resolution Ocean v6.07
hadcm3n_3g30_1980_40_008320815_0 8471950 24 Feb 2013 16:05:43 UTC 2 Mar 2013 13:22:18 UTC Error while computing 451,201.48 393,044.10 6,220.80 6,220.80 UK Met Office Coupled Model Full Resolution Ocean v6.07
hadcm3n_3msh_1980_40_008320807_0 8471942 24 Feb 2013 16:05:43 UTC 4 Mar 2013 0:17:33 UTC Error while computing 517,030.49 312,994.20 6,842.88 6,842.88 UK Met Office Coupled Model Full Resolution Ocean v6.07
hadcm3n_zfuw_1920_40_008320605_0 8471740 24 Feb 2013 15:04:21 UTC 4 Mar 2013 0:18:10 UTC Error while computing 484,250.75 452,475.50 6,842.88 6,842.88 UK Met Office Coupled Model Full Resolution Ocean v6.07
hadcm3n_4jjh_1940_40_008303591_1 8454726 23 Feb 2013 15:31:44 UTC 4 Mar 2013 0:18:10 UTC Error while computing 556,161.88 417,148.50 8,087.04 8,087.04 UK Met Office Coupled Model Full Resolution Ocean v6.07

I noticed a Windows pop-up Error with these. Basically it's asking do you want to close the app!
The two models that complete also encountered these, but I exited from Boinc, then closed the Error message, restarted the system and the WU's completed, eventually - Any chance we could get a Boinc setting to allow tasks to continuously run until they complete? Trying to run 7 or 8 models probably isn't the wisest so I run other projects when crunching for climate, but Boinc keeps jumping from project to project, even with a low cache and switch between apps set to 999min.

Sorry, too many model crashes! :-(

Boinc-wide I'm seeing a big increase in task failures. Something I attribute to Windows. So are these crashes related to the app or Windows?

PS. 5.10 is only required for domain controllers; it's not needed for member servers. (DC's don't have local accounts, used by subsequent Boinc versions).
ID: 45659 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 45660 - Posted: 15 Mar 2013, 13:42:13 UTC

The failures are either download errors or models that have a physics error at some point - i.e. invalid theta or negative pressure. The worrying thing about the physics errors is that the models continue. As far as I know the models are not adaptive: they propagate a single state in time increments. If that state ever becomes invalid it should stay invalid. So how do these models get past the physics error? A possible reason is that the hardware is failing, causing the model to crash, restart and continue with the state propagated correctly (or not obviously incorrectly).

Unfortunately, the recent models are ahead of the others in their work units or don't have like-for-like comparisons, so it isn't possible to check for parallel physics errors. The pop-up errors are also a sign of a machine problem. (I once had a berserk printer driver that caused constant pop-ups; most remain undiagnosed.)
ID: 45660 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45668 - Posted: 17 Mar 2013, 17:20:39 UTC







I have downloaded the following Model:

name - - - - - - hadcm3n_zl8m_1920_40_008280645

application - - UK Met Office Coupled Model Full Resolution Ocean

created - - - - 29 Dec 2012 15:07:36 UTC

as you can see all three of the other computers have all crashed this model after Various Trickles and Various -- <stderr_txt> <messages>

computer # 1 ... - Windows 7

Exiting with 10 Trickles Received ...
<![CDATA[
<message>
- exit code 193 (0xc1)
</message>

<stderr_txt>

Signal 11 received, exiting...
Called boinc_finish

</stderr_txt>
]]>

computer # 2 - Windows 8

Exiting with 1 Trickles Received ...
core_client_version>7.0.44</core_client_version>

<![CDATA[
<message>
?f?o?C?X???R?}???h???F???????????B (0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

Model crashed: INITDUMP: Wrong no of ocean prognostic fields tmp/pipe_dummy 2048
Model crashed: INITDUMP: Wrong no of ocean prognostic fields tmp/pipe_dummy 2048
Model crashed: INITDUMP: Wrong no of ocean prognostic fields tmp/pipe_dummy 2048
Model crashed: INITDUMP: Wrong no of ocean prognostic fields tmp/pipe_dummy 2048
Model crashed: INITDUMP: Wrong no of ocean prognostic fields tmp/pipe_dummy 2048
Model crashed: INITDUMP: Wrong no of ocean prognostic fields tmp/pipe_dummy 2048

Sorry, too many model crashes! :-(
Called boinc_finish

</stderr_txt>
]]>

computer 3 - Linux - 3.8.2-206.fc18.x86_64

Exiting with 10 Trickles Received ...

<core_client_version>7.0.29</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

<stderr_txt>

SIGSEGV: segmentation violation
Stack trace (14 frames):
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x80b80df]
[0xf779c400]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x806c0d5]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x806e5f2]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8072509]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8077f47]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x80781a3]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8053e1b]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8057bc4]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x804f232]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8050491]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805112c]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805137a]

/lib/libc.so.6(__libc_start_main+0xf5)[0xf744f865]

Exiting...

</stderr_txt>
]]>

I have not started to crunch this Model yet.

is it worth spending some of my CPU cycles to see how far i get ?

or is it just a waste of CPU cycles and should I just abort this Model ?

Thanks in advance

Byron




ID: 45668 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 45669 - Posted: 17 Mar 2013, 20:26:57 UTC
Last modified: 17 Mar 2013, 20:27:56 UTC

Byron:

In your position I would continue. Each of the three crashes has a log filled with numerous suspends and then slightly strange errors. The model may do better on a machine that provides a less stressed environment. (And my apologies in advance if it turns out badly!)

Iain
ID: 45669 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45670 - Posted: 18 Mar 2013, 0:22:22 UTC - in response to Message 45669.  
Last modified: 18 Mar 2013, 1:13:58 UTC

_




Iain:

Thank you very kindly for replying to my post. Yes I agree with you,
I will let this Model continue to run. When I crunch a model like this,
I know that Models like this - [UK Met Office Coupled Model Full Resolution Ocean]
do not like to be interrupted. So I run my computer 24/7 and I do not suspend or exit BOINC
until the Model has completed. usually approx. 21 days at 24/7 for my computer.
and not to worry if it turns out badly - no big deal :)

Byron


_
ID: 45670 · Report as offensive     Reply Quote
james

Send message
Joined: 15 Dec 06
Posts: 13
Credit: 2,539,487
RAC: 0
Message 45732 - Posted: 28 Mar 2013, 0:00:44 UTC

A UK MET Office Coupled Model Full Resolution Ocean 7.07
completed (100%), apprx. 5hrs, ago. It is now running "high priority".

Ques.: How much longer should it run?
ID: 45732 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 45733 - Posted: 28 Mar 2013, 0:05:32 UTC - in response to Message 45732.  

A UK MET Office Coupled Model Full Resolution Ocean 7.07
completed (100%), apprx. 5hrs, ago. It is now running "high priority".

Ques.: How much longer should it run?

You will probably find that stopping and starting BOINC will convince the model to finish. All the trickles have been logged, so it has certainly done all that the project requires it to do.
ID: 45733 · Report as offensive     Reply Quote
james

Send message
Joined: 15 Dec 06
Posts: 13
Credit: 2,539,487
RAC: 0
Message 45735 - Posted: 28 Mar 2013, 14:08:34 UTC - in response to Message 45733.  
Last modified: 28 Mar 2013, 14:09:06 UTC

Thanks, Iain. Good as done.
ID: 45735 · Report as offensive     Reply Quote
ProfileRay Murray
Avatar

Send message
Joined: 7 Aug 04
Posts: 50
Credit: 548,730
RAC: 0
Message 45895 - Posted: 11 Apr 2013, 21:13:58 UTC

Strange -226 error I've not had before with this WU. Seems it can't access the lock file saying something else is using it but it's the only CPDN wu on this machine since my EU finished yesterday.

<message>
too many boinc_temporary_exit()s
</message>

and a whole stack of:

04:30:07 (2840): Can't acquire lockfile (32) - waiting 35s
04:30:42 (2840): Can't acquire lockfile (32) - exiting
04:30:42 (2840): Error: The process cannot access the file because it is being used by another process. (0x20)

before it gave up and phoned home the error.

Thing is, it's still running [scrathes head]
It's not reporting checkpoints to Boinc (I've got task_debug set in cc_config) and progress is stuck although still working in the graphics. It's writing stuff in the data out folder as exiting Boinc and even restarting the machine restarts without losing any time. [Further head scratching]
It's due to trickle within the hour, so we'll see what happens then.
At least it got far enough to send the 75% decadal trickle and these have been know to be twitchy about this point. I suspect I'll have to euthanase it.
ID: 45895 · Report as offensive     Reply Quote
ProfileRay Murray
Avatar

Send message
Joined: 7 Aug 04
Posts: 50
Credit: 548,730
RAC: 0
Message 45898 - Posted: 11 Apr 2013, 23:11:13 UTC
Last modified: 11 Apr 2013, 23:22:48 UTC

Just missed the edit deadline so here's the update;

Trickle went up fine and registered. Further digging shows that it re-downloaded some files, atmos,ocean, etc. on restart of the machine and also alot of

task] Process for hadcm3n_3jqp_1940_40_008265630_1 exited, exit code 0, task state 1
11-Apr-2013 19:42:22 [climateprediction.net] [task] task called temporary_exit(600.000000, )
11-Apr-2013 19:42:22 [climateprediction.net] [task] task_state=UNINITIALIZED for hadcm3n_3jqp_1940_40_008265630_1 from handle_temporary_exit
11-Apr-2013 19:42:22 [climateprediction.net] Task hadcm3n_3jqp_1940_40_008265630_1 exited with zero status but no 'finished' file
11-Apr-2013 19:42:22 [climateprediction.net] If this happens repeatedly you may need to reset the project.
11-Apr-2013 19:42:22 [climateprediction.net] [task] task_state=UNINITIALIZED for hadcm3n_3jqp_1940_40_008265630_1 from handle_premature_exit

Not just for CPDN but other projects as well and it has even generated new computer ids on a couple of projects. I therefore suspected that it was a Boinc problem and have reinstalled it.
Checkpoints now showing in Boinc, progress up to level with graphics so I'm just going to pretend it is like having deployed a backup similar to what we often had to do with the BBC models all those years ago, althoughI won't really know if it's worked until it gets to the final uploads (5 days away)

And finally the smoking gun:
From wading through stdoutdae and stdoutdae.old, there are no CPDN checkpoints after a machine restart after a Windows update.
BEWARE WINDOWS UPDATE
Strange that other projects weren't effected until after the restart to try to fix CPDN but all's well since the reinstall of Boinc, even if it did cost me a nearly finished T4T wu.

Off to bed happy I've sorted it and found the cause.
ID: 45898 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 45903 - Posted: 12 Apr 2013, 1:36:43 UTC

hello

not sure if this will help in troubleshooting, but here is what i have:


4/11/2013 9:21:55 PM climateprediction.net Giving up on download of hadam3p_pnw_c1zs_1959_1_007935543.zip: file not found


frank
ID: 45903 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45905 - Posted: 12 Apr 2013, 4:46:52 UTC - in response to Message 45903.  

Frank

That model is from 18 April 2012, so the files will no longer be on the servers.


ID: 45905 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : Number crunching : Reporting - Errors while computing -

©2024 cpdn.org