Message boards : Number crunching : Reporting - Errors while computing -
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Like Les, I don't think the Africa models will come onto this main project for a while yet. The main researcher for this Africa project is Friederike Otto in Oxford. She's German and has been in the UK about five years. She's just very recently been told that the funding for the Africa project has come through. Myles and Friederike are planning a meeting soon in Nairobi to discuss the project with African researchers. The ANZ research will as far as I know be carried out in Hobart, Tasmania. Myles has also been there to discuss this project with the Australians. I get the impression that most of the research using the regional models concentrates on attribution studies trying to calculate whether and to what extent climate change is responsible for particular weather phenomena. I wouldn't be at all surprised if the Australians want to look at whether climate change has played a part in causing the atrocious hot summers and drought there in the last few years, or whether it's just freak random bad luck to be expected from time to time. Thing have come a long way since the days when we only had one type of model and all the research was carried out in Oxford. Cpdn news |
Send message Joined: 30 Jan 12 Posts: 38 Credit: 10,197,388 RAC: 0 |
Thanks guys, that brings allot of things in to focus. I'm looking forward to working on ANZ. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
hadcm3n_o44g_2140_40_008281590 I just down Loaded this model - UK Met Office Coupled Model Full Resolution Ocean v6.07 - year 2140 this Modle was Created 13 Jan 2013 23:46:25 UTC as you can see the two other computer have both crashed this model after 20 Trickles with a - - STWORK : I/O error - PP fixed length header - - error. <core_client_version>7.0.28</core_client_version> I have't started crunching this model yet so should I abort this Model ? - hadcm3n_o44g_2140_40_008281590 also I'm just curious what does the error - - - Model crashed: STWORK : I/O error - PP fixed length header tmp/pipe_dummy 2048 ... Mean ? on this my fastest computer I'm only running one project - only Climate Prediction.net - 24/7/365 - and no GPU - apps 8 physical CPU - no Hyper threading So I now have 8 UK Met Office Coupled Model Full Resolution Ocean v6.07 - running now - 24/7 - verry nicely so far one each for the years _1960_1940_1920_ and five each for the year_1880_ 8 Models crunching nicely with no problems so far. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
The 2140 models are crashing at 75%, if they haven't already crashed for some other reason at that point. I would definitely abort it. And, I'm not sure what the error means, but usually when a bad batch goes out, it has some problem with an ancillary file that is not setup right for the model to continue past whatever common point the models are crashing at. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
This model doesn't belong to a whole defective batch; I've looked at several workunits before it and several after, and a good proportion are completing. However, your workunit isn't the only instance of this error: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8432723 Here three tasks from the same WU have crashed at the same point, though all three computers seem to be decent model crunchers. (One of those computers has BOINC 5.10.something. Should we try to contact this member? I'm surprised that BOINC 5 can still crunch these models, though I bet the owner can't see the graphics.) So I think the same thing would happen to your model, particularly as you have the same OS as the computers that have already had that crash. I don't know what the error means except to say that when you see the model trying to recover 5 times then crashing the 6th time (as is the case here) it often means the problem lies within the model. The models are fortunately designed not to try to recover indefinitely. If you're watching the graphics when this sort of thing happens you see the globe window go black for a second. The model restarts from the last timestep, then the same thing happens again at exactly the same timestep. On the sixth crash it doesn't loop back and try again. STWORK is in uppercase. This often indicates a fault within the model (cf REPLANCA). This isn't very scientific I'm afraid, just things I've noticed. I've found a very old file from the National Centre for Atmospheric Science which appears to be part of the design for the Unified Model (which is what all our models are based on) for the Met Office: http://cms.ncas.ac.uk/code_browsers/UM4.5/UMbrowser/html_code/UM/STWORK1A.F.html STWORK seems to introduce a subroutine and at the end there's a possible error message very similar to yours. But we mods aren't model programmers. Cpdn news |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Mo, Every single 2140 task I've seen has crashed at 75% or before. Yes there are work units near the one that contains Byron's listed task, that haven't had all tasks crash, but they are not 2140 work units. Byron, abort it if you haven't already. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
You're right. The batch contains models for several different 40-year periods and it's just the 2140 WUs that generate this error. Cpdn news |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
... (One of those computers has BOINC 5.10.something. Should we try to contact this member? I'm surprised that BOINC 5 can still crunch these models, though I bet the owner can't see the graphics.) ...There is at least one good reason for continuing with the 5-series BOINC, which is that it runs on some of Microsoft's server operating systems. However, this particular user has a range of machines with varying BOINC versions running on mostly Windows XP, with a couple of old-ish servers too. It appears to be a choice. |
Send message Joined: 5 Jun 06 Posts: 28 Credit: 2,790,048 RAC: 0 |
hadcm3n_zl88_1960_40_008321064_0 8472199 24 Feb 2013 17:06:24 UTC 15 Mar 2013 0:48:30 UTC Completed 788,994.61 786,033.00 --- 11,819.52 UK Met Office Coupled Model Full Resolution Ocean v6.07 hadcm3n_3i54_1980_40_008320817_0 8471952 24 Feb 2013 16:05:43 UTC 15 Mar 2013 4:45:39 UTC Completed 794,894.29 597,384.40 --- 11,508.48 UK Met Office Coupled Model Full Resolution Ocean v6.07 hadcm3n_3g30_1980_40_008320815_0 8471950 24 Feb 2013 16:05:43 UTC 2 Mar 2013 13:22:18 UTC Error while computing 451,201.48 393,044.10 6,220.80 6,220.80 UK Met Office Coupled Model Full Resolution Ocean v6.07 hadcm3n_3msh_1980_40_008320807_0 8471942 24 Feb 2013 16:05:43 UTC 4 Mar 2013 0:17:33 UTC Error while computing 517,030.49 312,994.20 6,842.88 6,842.88 UK Met Office Coupled Model Full Resolution Ocean v6.07 hadcm3n_zfuw_1920_40_008320605_0 8471740 24 Feb 2013 15:04:21 UTC 4 Mar 2013 0:18:10 UTC Error while computing 484,250.75 452,475.50 6,842.88 6,842.88 UK Met Office Coupled Model Full Resolution Ocean v6.07 hadcm3n_4jjh_1940_40_008303591_1 8454726 23 Feb 2013 15:31:44 UTC 4 Mar 2013 0:18:10 UTC Error while computing 556,161.88 417,148.50 8,087.04 8,087.04 UK Met Office Coupled Model Full Resolution Ocean v6.07 I noticed a Windows pop-up Error with these. Basically it's asking do you want to close the app! The two models that complete also encountered these, but I exited from Boinc, then closed the Error message, restarted the system and the WU's completed, eventually - Any chance we could get a Boinc setting to allow tasks to continuously run until they complete? Trying to run 7 or 8 models probably isn't the wisest so I run other projects when crunching for climate, but Boinc keeps jumping from project to project, even with a low cache and switch between apps set to 999min. Sorry, too many model crashes! :-( Boinc-wide I'm seeing a big increase in task failures. Something I attribute to Windows. So are these crashes related to the app or Windows? PS. 5.10 is only required for domain controllers; it's not needed for member servers. (DC's don't have local accounts, used by subsequent Boinc versions). |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
The failures are either download errors or models that have a physics error at some point - i.e. invalid theta or negative pressure. The worrying thing about the physics errors is that the models continue. As far as I know the models are not adaptive: they propagate a single state in time increments. If that state ever becomes invalid it should stay invalid. So how do these models get past the physics error? A possible reason is that the hardware is failing, causing the model to crash, restart and continue with the state propagated correctly (or not obviously incorrectly). Unfortunately, the recent models are ahead of the others in their work units or don't have like-for-like comparisons, so it isn't possible to check for parallel physics errors. The pop-up errors are also a sign of a machine problem. (I once had a berserk printer driver that caused constant pop-ups; most remain undiagnosed.) |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
I have downloaded the following Model: name - - - - - - hadcm3n_zl8m_1920_40_008280645 application - - UK Met Office Coupled Model Full Resolution Ocean created - - - - 29 Dec 2012 15:07:36 UTC as you can see all three of the other computers have all crashed this model after Various Trickles and Various -- <stderr_txt> <messages> computer # 1 ... - Windows 7 Exiting with 10 Trickles Received ... <![CDATA[ computer # 2 - Windows 8 Exiting with 1 Trickles Received ... core_client_version>7.0.44</core_client_version> computer 3 - Linux - 3.8.2-206.fc18.x86_64 Exiting with 10 Trickles Received ... <core_client_version>7.0.29</core_client_version> I have not started to crunch this Model yet. is it worth spending some of my CPU cycles to see how far i get ? or is it just a waste of CPU cycles and should I just abort this Model ? Thanks in advance Byron |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
Byron: In your position I would continue. Each of the three crashes has a log filled with numerous suspends and then slightly strange errors. The model may do better on a machine that provides a less stressed environment. (And my apologies in advance if it turns out badly!) Iain |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
_ Iain: Thank you very kindly for replying to my post. Yes I agree with you, I will let this Model continue to run. When I crunch a model like this, I know that Models like this - [UK Met Office Coupled Model Full Resolution Ocean] do not like to be interrupted. So I run my computer 24/7 and I do not suspend or exit BOINC until the Model has completed. usually approx. 21 days at 24/7 for my computer. and not to worry if it turns out badly - no big deal :) Byron _ |
Send message Joined: 15 Dec 06 Posts: 13 Credit: 2,539,487 RAC: 0 |
A UK MET Office Coupled Model Full Resolution Ocean 7.07 completed (100%), apprx. 5hrs, ago. It is now running "high priority". Ques.: How much longer should it run? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
A UK MET Office Coupled Model Full Resolution Ocean 7.07 You will probably find that stopping and starting BOINC will convince the model to finish. All the trickles have been logged, so it has certainly done all that the project requires it to do. |
Send message Joined: 15 Dec 06 Posts: 13 Credit: 2,539,487 RAC: 0 |
Thanks, Iain. Good as done. |
Send message Joined: 7 Aug 04 Posts: 50 Credit: 548,730 RAC: 0 |
Strange -226 error I've not had before with this WU. Seems it can't access the lock file saying something else is using it but it's the only CPDN wu on this machine since my EU finished yesterday. <message> too many boinc_temporary_exit()s </message> and a whole stack of: 04:30:07 (2840): Can't acquire lockfile (32) - waiting 35s 04:30:42 (2840): Can't acquire lockfile (32) - exiting 04:30:42 (2840): Error: The process cannot access the file because it is being used by another process. (0x20) before it gave up and phoned home the error. Thing is, it's still running [scrathes head] It's not reporting checkpoints to Boinc (I've got task_debug set in cc_config) and progress is stuck although still working in the graphics. It's writing stuff in the data out folder as exiting Boinc and even restarting the machine restarts without losing any time. [Further head scratching] It's due to trickle within the hour, so we'll see what happens then. At least it got far enough to send the 75% decadal trickle and these have been know to be twitchy about this point. I suspect I'll have to euthanase it. |
Send message Joined: 7 Aug 04 Posts: 50 Credit: 548,730 RAC: 0 |
Just missed the edit deadline so here's the update; Trickle went up fine and registered. Further digging shows that it re-downloaded some files, atmos,ocean, etc. on restart of the machine and also alot of task] Process for hadcm3n_3jqp_1940_40_008265630_1 exited, exit code 0, task state 1 11-Apr-2013 19:42:22 [climateprediction.net] [task] task called temporary_exit(600.000000, ) 11-Apr-2013 19:42:22 [climateprediction.net] [task] task_state=UNINITIALIZED for hadcm3n_3jqp_1940_40_008265630_1 from handle_temporary_exit 11-Apr-2013 19:42:22 [climateprediction.net] Task hadcm3n_3jqp_1940_40_008265630_1 exited with zero status but no 'finished' file 11-Apr-2013 19:42:22 [climateprediction.net] If this happens repeatedly you may need to reset the project. 11-Apr-2013 19:42:22 [climateprediction.net] [task] task_state=UNINITIALIZED for hadcm3n_3jqp_1940_40_008265630_1 from handle_premature_exit Not just for CPDN but other projects as well and it has even generated new computer ids on a couple of projects. I therefore suspected that it was a Boinc problem and have reinstalled it. Checkpoints now showing in Boinc, progress up to level with graphics so I'm just going to pretend it is like having deployed a backup similar to what we often had to do with the BBC models all those years ago, althoughI won't really know if it's worked until it gets to the final uploads (5 days away) And finally the smoking gun: From wading through stdoutdae and stdoutdae.old, there are no CPDN checkpoints after a machine restart after a Windows update. BEWARE WINDOWS UPDATE Strange that other projects weren't effected until after the restart to try to fix CPDN but all's well since the reinstall of Boinc, even if it did cost me a nearly finished T4T wu. Off to bed happy I've sorted it and found the cause. |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
hello not sure if this will help in troubleshooting, but here is what i have: 4/11/2013 9:21:55 PM climateprediction.net Giving up on download of hadam3p_pnw_c1zs_1959_1_007935543.zip: file not found frank |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Frank That model is from 18 April 2012, so the files will no longer be on the servers. |
©2024 cpdn.org