Thread 'HADCM Geo-Eng Iceworld ?'

Author	Message
JimMcCarthy_StellarSolns Send message Joined: 3 Sep 08 Posts: 23 Credit: 42,176,240 RAC: 10,608	Message 36121 - Posted: 11 Feb 2009, 0:35:06 UTC I\'m starting this new thread since the existing \"Iceworld\" number-crunching thread(s) pertain to HadSM3/HadSM3H models, and I need to report \"Iceworld\" -like behavior with a HadCM3-ivolc model (hadcm3ivolc_l4h1_2000_08_26002275_2) on one of my computers. Here\'s some hopefully-relevant information: 1. Model/ResultID ... follow link to Task ID 8149927. 2. Current timestep reported on the graphic display for this model was reported ABOUT 5 HOURS AGO to be 959,639 of 2,073,960 (Phase 1 of 1), corresponding to model date and time are 13/06/2056 11:30. Currently (5 HOURS LATER), the status is timestep 939,635 with corresponding date and time of 13/06/2056 09:30, indicating it\'s gone backwards slightly since I last checked it. Via private message, I learned from Mo.v that: If HADCMs get stuck they\'re supposed to repeat crunching from the last checkpoint (6-day period), then if that fails to get through the problem point, repeat the last model month, then from the start of the model year. It could be that your model is still trying these loops and may yet succeed in getting through its computational nightmare. If the looping process fails to save a HADCM they\'re supposed to crash. It\'s a long time since we had a report of a looper. Last trickle was reported on 28-Jan-2009 (but this machine is only connected to the internet one or two days per week), at timestep 1,425,600 (which doesn\'t make sense ... except maybe if it\'s looped way way back?). For what it may be worth, I have a WinZIP backup archive of my BOINC data (work) folders on 27-Jan-2009, so I could restore this model from that date. 3. The s/TS value on the globe graphic reported 5 HOURS AGO the Hours Elapsed were 2298:48:08 (8.62 s/TS). Currently (5 HOURS LATER) the Hours Elapsed are 2305:25:50. Consistent with HadSM3 \"Iceworld\" behavior, these s/TS values now are much greater (slower) than that reported during the last trickle on 28-Jan-2009 (which back then was 4.893 s/TS), when the model was working \"normally\". 4. The temperature display of the globe graphic is blue. Clouds only appear surrounding the coastline of Antarctica and over the north pole. 5. What processor/CPU is (i.e. Intel, AMD) ? My Processor is GenuineIntel, Intel(R) Xeon(TM) CPU 2.80GHz [x86 Family 15 Model 4 Stepping 10]. Operating System is Microsoft Windows XP Professional x86 Editon, Service Pack 3, (05.01.2600.00). 6. Overclocking ? No. However, on this same machine I\'ve had 2 previous HadSM3 models turn into iceworlds. I won\'t abort the model or reset it back to the 27-Jan-2009 backup until after receiving replies in this forum indicating some consensus recommendation. Regards, --Jim ID: 36121 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 36122 - Posted: 11 Feb 2009, 11:39:19 UTC Last modified: 11 Feb 2009, 11:42:24 UTC Thanks for the very detailed information, Jim. If this model is a HADCM looper of the type we\'ve seen fairly frequently in the past it should by now have tried three loops of increasing length (the longest of which would be from the start of the previous December) and either got through the bad moment or crashed. But Jim\'s model doesn\'t fit the expected behaviour of HADCM loopers: * the graphics are blue like a HADSM iceworld, whereas with HADCM loopers one would expect normal graphics while it performs the loops * the timesteps Jim reports indicate it\'s gone back further than the beginning of its model year The only model we can compare it with in that workunit hasn\'t trickled for over a week and was then a good way behind Jim\'s anyway. The only known way HADCM loopers that don\'t resolve themselves can be got past the loop point is to transfer a backup made when the model was still healthy from AMD to Intel or vice-versa. But a number of people think this extraordinary solution keeps alive an abnormal model that should be allowed to crash or aborted. I think all Jim\'s computers are Intels anyway, so the extraordinary solution isn\'t an option. Last year there I think there was a report (maybe on Beta) of a model going back further than the start of its model year. Can anyone remember what caused that? Jim\'s model has been battling with its problem for nearly two weeks. The model is unlikely to to resolve its problem spontaneously now. Does anyone think Jim should try restoring his backup made one day before the last healthy trickle, or abort the model? Is there any data Jim could collect before a restore or abort? Cpdn news ID: 36122 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,716,561 RAC: 8,355	Message 36123 - Posted: 11 Feb 2009, 11:50:46 UTC Well, Jim should obviously take a second independent backup of the task in its current state - taking care not to over-write the 27-Jan-2009 backup, of course - in case any further questions come to light after he takes whatever course of action we decide here. ID: 36123 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 36124 - Posted: 11 Feb 2009, 12:07:50 UTC Last modified: 11 Feb 2009, 12:17:12 UTC The current timestep Jim\'s seeing, approx 939,635, doesn\'t correspond to the model date in June 2056 that he\'s seeing simultaneously. The timestep corresponds to a model date about 17 years earlier, about 2040. The timesteps he\'s seeing, 939,635 to 959639, include a much earlier trickle point. They\'re not all within one model year. But the earlier trickles must have all included good data because the model\'s graphs look fine. Cpdn news ID: 36124 · Reply Quote

JimMcCarthy_StellarSolns Send message Joined: 3 Sep 08 Posts: 23 Credit: 42,176,240 RAC: 10,608	Message 36125 - Posted: 11 Feb 2009, 17:01:47 UTC As I stated below, this computer is only connected to the internet once or twice per week (it\'s a \"spare\" computer I have at work and watch over now and then). As I\'m not using it routinely, I\'ve made it available for visitor/guest use, and since it dual boots Windows and Linux, colleagues and guest users can shut it down and reboot it at will (whether or not all Windows shutdowns are \"graceful\", I cannot say for sure, but it\'s likely BOINC manager is still running the 4 tasks when/if even a \"graceful\" Windows shutdown is requested). Looking at the other CPDN models running on this computer, I see trickle uploads on 28-Jan-2009, 03-Feb-2009, 06-Feb-2009, and 10-Feb-2009, suggesting whatever anomaly may have sent this model into a tailspin happened between the 28-Jan trickle and the next network-open opportunity on 03-Feb. As the trickle dates are GMT and the date I quoted for my backup was local time (GMT -8 hours), I expect these were nearly coincident in time (i.e., the backup is not a day earlier than the last trickle -- sorry for the confusion). My sporadic CPDN maintenance routine \"usually\" goes like this ... 1. Re-open network connection to internet (requires manual authentication to get through a company firewall). 2. In BOINC manager, change from \"Network Activity Suspended\" to one of the other choices. 3. Prompt BOINC manager to communicate over network. 4. Wait while any pending uploads/downloads occur, monitoring the BOINC manager \"Transfers\" tab. 5. When completed, \"Suspend\" all tasks. 6. Shutdown connected client, and wait for CPU time reported for all tasks to stop increasing. 7. Close down BOINC manager window, and close (Exit) the BOINC taskbar icon. 8. Backup entire BOINC data directory tree using WinZIP. 9. Shutdown the machine, and reboot it, and login. 10. Open BOINC manager and \"Resume\" all tasks. 11. Suspend network activity (since firewall authentication will expire after a few hours of inactivity and in any case overnight). (If I don\'t have time at any particular moment to do a backup [or it\'s only been a few days since the last backup], I just do steps 1 -- 4 and 11 to allow network communication to occur.) Is this a reasonably good step-by-step procedure for me to be using on this machine ? Given the chance for less graceful (as far as BOINC is concerned) shutdowns in between my sporadic checking-in on things, and the big negative delta-time-step between the 28-Jan trickle (time step 1,425,600) and my belated discovery yesterday of the \"iceworld\" graphic appearance (timestep 939,635 reported), I\'d be inclined to restore this model task from the 27-Jan-2009 backup (GMT 28-Jan-2009) I made just after that trickle, and see what happens. Presuming that if this model task were healthy on 03-Feb-2009, it likely would have reported a trickle while I had the network open on that date, so if the restored model task encounters the same anomaly, the repeated failure ought to be apparent in less than a week. Maybe one \"lesson learned\" here (particularly if this model task runs fine once restored from backup) is that I need to post instructions near this machine for colleagues / visitors, \"Please follow these steps to shutdown Windows XP and reboot\" ... if steps 5 and 6 from my recipe should be performed before every otherwise \"graceful\" shutdown of Windows ? (Unfortunately that also means I must rely on colleagues / visitors / guest-users to perform step 10 whenever Windows is rebooted, otherwise days could go by with the computer running but model tasks remaining in the \"suspended\" state ... assuming there\'s no automatic \"resume tasks on restart\" option in BOINC manager I don\'t know about ?) Thanks for any additional advice or comments, -- Jim ID: 36125 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 36126 - Posted: 11 Feb 2009, 19:34:46 UTC - in response to Message 36122. Hi, this is a different Jim. About the other JimÃ¢â‚¬â„¢s looping model, I am one of those people who tried to save a looping Hadcm3 model by transferring it from an Intel to an AMD processor. While it solved the immediate problem, the looping problem recurred every few years. I had to ultimately abort the WU. Still, since he is in 2056, it might be worth a try. The 2009 backup is worth a try also, even if he does loose 40 years! And Jim for haven sakes make backups more often. IÃ¢â‚¬â„¢ve learned the hard way to make one every morning. Thanks for the very detailed information, Jim. If this model is a HADCM looper of the type we\'ve seen fairly frequently in the past it should by now have tried three loops of increasing length (the longest of which would be from the start of the previous December) and either got through the bad moment or crashed. But Jim\'s model doesn\'t fit the expected behaviour of HADCM loopers: * the graphics are blue like a HADSM iceworld, whereas with HADCM loopers one would expect normal graphics while it performs the loops * the timesteps Jim reports indicate it\'s gone back further than the beginning of its model year The only model we can compare it with in that workunit hasn\'t trickled for over a week and was then a good way behind Jim\'s anyway. The only known way HADCM loopers that don\'t resolve themselves can be got past the loop point is to transfer a backup made when the model was still healthy from AMD to Intel or vice-versa. But a number of people think this extraordinary solution keeps alive an abnormal model that should be allowed to crash or aborted. I think all Jim\'s computers are Intels anyway, so the extraordinary solution isn\'t an option. Last year there I think there was a report (maybe on Beta) of a model going back further than the start of its model year. Can anyone remember what caused that? Jim\'s model has been battling with its problem for nearly two weeks. The model is unlikely to to resolve its problem spontaneously now. Does anyone think Jim should try restoring his backup made one day before the last healthy trickle, or abort the model? Is there any data Jim could collect before a restore or abort? ID: 36126 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 36128 - Posted: 12 Feb 2009, 12:06:24 UTC Last modified: 12 Feb 2009, 12:07:04 UTC That looks to me like a good routine, JimMc. I don\'t think your model is either a typical iceworld or a typical looper. It could be that if the computer was turned off without exiting from BOINC, the model\'s computation has been thrown into chaos. I\'d try restoring that backup (after backing up the model in its current state) and if you see any further signs of model madness, abort it. It would indeed be a good idea to stick a postit beside this computer\'s screen with at the very least instructions for exiting from BOINC before shutting the computer down. If you refer to the BOINC icon as the \'fried-egg-on-grill\' everyone should recognise it. If you have BOINC installed as a service, make sure you tell people to carry out the 2-stage BOINC exit as explained at the end of this post. Cpdn news ID: 36128 · Reply Quote

JimMcCarthy_StellarSolns Send message Joined: 3 Sep 08 Posts: 23 Credit: 42,176,240 RAC: 10,608	Message 36130 - Posted: 13 Feb 2009, 10:23:19 UTC Mo.v -- Following the link you provided, I find: Q. How do I shut BOINC down now? Exiting BOINC Manager, BOINC and the applications keep on running! And how do I get it running afterwards? A. There are a couple of ways to shut down the service. The easiest method to shut down BOINC and get it running again is to go through BOINC Manager->Advanced view->Advanced->Shut down connected client. This will shut down the service. Next you go File->Exit to close down BOINC Manager. To start BOINC back up, go Start->Programs->BOINC->BOINC Manager. This will start up BOINC Manager, which in turn starts the service. which seams to suggest (or at least, doesn\'t recommend against the idea) that the \"Shut down connected client\" (my step #6) can be done without first \"Suspend\"ing all model tasks running on the machine (my step #5 -- sorry, the \"wait for CPU time to stop accumulating\" belongs attached to step #5, not to step #6 as written [too quickly] by me below -- i.e., I \"suspend\" the tasks then wait to confirm they all stop accumulating CPU before going on to step #6 and shutting down the connected client). Omitting step #5 from the Post-It-Note(TM) instructions would retire my concern about computer (and BOINC service) restarting whenever someone reboots it back to Windows, but then having CPDN model tasks being left for days in \"suspended\" state (if nobody is available or remembers to perform step #10) ... but is omitting step #5 okay based on experience of those reading this thread ? All-caps JIM says he performs backups every day ... assuming he shuts down BOINC service completely before initiating backups (necessary in my experience to avoid WinZIP not being able to access certain files in BOINC data directory \"locked\" by the service whenever it\'s running), does he \"suspend\" all the CPDN model tasks before shutting down BOINC for each daily backup ? Thanks much, -- Jim P.S. Apologies if this post seems somewhat tangential to the thread topic, but my gut feeling matches Mo.v\'s that what sent this model into chaos was most likely a not-sufficiently-graceful shutdown (abrupt turn off?) of the computer, and so it\'s relevant what steps belong on the Post-It-Note(TM), as a bare minimum.... ID: 36130 · Reply Quote

JimMcCarthy_StellarSolns Send message Joined: 3 Sep 08 Posts: 23 Credit: 42,176,240 RAC: 10,608	Message 36233 - Posted: 26 Feb 2009, 23:11:28 UTC Just an update: 1) It took me an extra week to find time to get back to this model and take any action (so it continued to run for an extra week since last report to this thread), and when I did so I found the model was still (a week later) stuck in a loop around timestamp 939,635 --- i.e., it had not managed to work it\'s way out of whatever trap it had fallen into. 2) Looking at the time-stamp of my backup from 27-Jan-2009, compared to the GMT time-stamp of the last successful trickle upload, I could not convince myself any evidence existed that the model worked normally after that backup (which also involved a system shutdown, although supposedly a \'graceful\' one). So I became suspicious that the 27-Jan-2009 backup might not be trustworthy for this CPDN model task. And indeed when I did restore it from the 27-Jan-2009 backup, it went into \"iceworld\" mode almost immediately after resuming the model (within a handful of time steps). 3) Therefore I decided instead to restore this model task from an earlier backup (about 1 week older), which I did on 17-Feb-2009. I\'m happy to report back now that the model has been crunching normally for the last week or so and has now progressed beyond the point (last valid trickle reported was at time step 1,425,600) around where the problem occured previously, and is once again reporting new trickle uploads. As far as I can tell, all looks normal. The BOINC manager reports I still have another 710 cpu hours to completion, but with any luck this model task will indeed finish successfully (eventually). Cheers, -- Jim ID: 36233 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 36234 - Posted: 26 Feb 2009, 23:22:30 UTC I think the CPDN researchers are very fortunate to have members who make such imaginative efforts to get as many models as possible to completion. Cpdn news ID: 36234 · Reply Quote