Thread '\'Maximum CPU time exceeded\' crash fixed'

Author	Message
old_user22652 Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0	Message 33479 - Posted: 20 Apr 2008, 16:52:15 UTC Checking my computers the other day, I found one model had crashed after 2276 hours of processing, approx. 87% complete. I restored a backup from the previous day, but it crashed again at the same point, with the message Ã¢â‚¬Å“exceeded CPU time limit.Ã¢â‚¬Â My thoughts were those of Douglas AdamsÃ¢â‚¬â„¢ bowl of petunias: Ã¢â‚¬Å“Oh no, not again!Ã¢â‚¬Â I managed to find a post from mo.v from 7/5/07 and followed the advice: find <rsc_fpops_bound> in Client_State.xml, and double the number found there. It was, as she said, Ã¢â‚¬Å“surprisingly easyÃ¢â‚¬Â. Well, the model is now processing well past 2276 hrs, and looking to complete in around two weeks. Many thanks, mo.v; petunias nothing. John GW3PRV ID: 33479 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 33481 - Posted: 20 Apr 2008, 17:51:20 UTC Twin congratulations on having a backup, and also getting it working again :-) I take it this was a model which had migrated from a slower computer to a faster one? I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 33481 · Reply Quote

old_user22652 Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0	Message 33482 - Posted: 20 Apr 2008, 19:05:41 UTC - in response to Message 33481. No, just a boring old Athlon 4000. I\'ve become very enthusiastic about backups recently, having had several crashes in the last week or so. Perhaps \"paranoid\" would be a better word. ID: 33482 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 33483 - Posted: 20 Apr 2008, 21:40:18 UTC Curious... were the benchmarks on that PC ever unusually high? I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 33483 · Reply Quote

Iain Inglis Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317	Message 33484 - Posted: 20 Apr 2008, 21:52:14 UTC Could this be another side-effect of the 8-zip 160-year models? This looks to be the one: 7123896, and it\'s the right date ... ID: 33484 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 33485 - Posted: 20 Apr 2008, 23:15:41 UTC Hi GW I\'m glad that the method helped save your model too. When this happened to a model of mine some time after moving it to a much faster computer, I was lucky that Thyme Lawn immediately diagnosed the cause and that I\'d also been making backups. There was a bunch of 160-year HADCM workunits issued about 4 months ago where the fpops limit was preset lower than usual. Most of these models will be perfectly OK and will complete within the lower limit set, but if any of them for example loop a few extra years, or crash for other reasons and have to crunch extra years after being restored, or the computer is frequently turned off sending the model back to the last checkpoint, they may get precariously close to the preset maximum processing time allowed. GW, would you mind changing the title of your thread to something like \'\'Maximum CPU time exceeded\' crash fixed\' so the topic becomes easier to find? (I don\'t mean to take away your success!) Cpdn news ID: 33485 · Reply Quote

old_user22652 Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0	Message 33490 - Posted: 21 Apr 2008, 6:24:31 UTC - in response to Message 33485. Iain has identified the right model. It has always run on this computer; I havenÃ¢â‚¬â„¢t noticed any abnormal benchmarks. Mo.v: the model is stopped every day to allow a backup to be made; I stop it without reference to the checkpoint counter, so I guess it may have had extra processing to do. The number in <rsc_fpops_bound> began with 21 which I doubled to 42. (Douglas Adams again?) Would this be one of the models you referred to? John. ID: 33490 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 33492 - Posted: 21 Apr 2008, 8:03:32 UTC - in response to Message 33484. Last modified: 21 Apr 2008, 12:10:55 UTC Could this be another side-effect of the 8-zip 160-year models? ... That\'s an interesting point, we could be seeing a lot more of these over the next few months ... it may be worth a few extra stickys? Iain has identified the right model. It has always run on this computer; I havenÃ¢â‚¬â„¢t noticed any abnormal benchmarks. ... I think Iain has identified the cause - there were a bunch of models produced which thought they were 80-year models when they were really 160 year models. So they had the lower limits on cpu usage and so forth. Fortunately being coupled models most of the important data is uploaded in the trickles as the model runs, so reaching the end, while ideal, is not critical. -- Edit: Fixed attribution :-) I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 33492 · Reply Quote

old_user428438 Send message Joined: 1 Feb 07 Posts: 26 Credit: 885,216 RAC: 0	Message 33498 - Posted: 21 Apr 2008, 10:14:11 UTC - in response to Message 33492. Last modified: 21 Apr 2008, 10:14:55 UTC Could this be another side-effect of the 8-zip 160-year models? ... That\'s an interesting point, we could be seeing a lot more of these over the next few months ... it may be worth a few extra stickys? Iain has identified the right model. It has always run on this computer; I havenÃ¢â‚¬â„¢t noticed any abnormal benchmarks. ... I think Les has identified the cause - there were a bunch of models produced which thought they were 80-year models when they were really 160 year models. So they had the lower limits on cpu usage and so forth. Fortunately being coupled models most of the important data is uploaded in the trickles as the model runs, so reaching the end, while ideal, is not critical. I have one these models here which I suspended several weeks ago. I had edited the client_state file as documented here and it happily crunched to the 80 year mark and returned 8 zip files. Shortly thereafter it reset to zero CPU time. Since there was a wingman who seemed to be progressing satisfactorily, I suspended that WU and allowed a shorter model to crunch through and then started on another 160 year model. It now appears that my co-cruncher on this model may have hit a problem since his last trickle was on 20 March (see here. Once I have completed my current model (in just over a month), I could go back to this one but I guess that, since the time was reset, that would give 35 days of duplicate trickles before it reaches the point at which it reset and then I find out if it resets again!! My current target is to have at least as many successful runs as failed runs on my account so I am looking for any thoughts on anything further that I can check to increase the likelyhood that my next attempt with this model will succeed. F. ID: 33498 · Reply Quote

Iain Inglis Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317	Message 33500 - Posted: 21 Apr 2008, 11:45:51 UTC - in response to Message 33498. Last modified: 21 Apr 2008, 11:47:24 UTC ... My current target is to have at least as many successful runs as failed runs on my account so I am looking for any thoughts on anything further that I can check to increase the likelyhood that my next attempt with this model will succeed. F. The situation is somewhat complicated by another bug - the contraction of some model error reports. So, the model with most progress in that work unit has crashed with a 22 error, which could be PC-induced but could also be \'negative pressure\' etc. - the necessary text is missing. The machine running that model is a Core 2 Duo and was running another long model in tandem, which finished on 9 April; that user also uploads trickles in batches, but has downloaded another long model. I doubt the crashed model will be finished. From your point of view, the model is going to be a failure unless you carry on with it. The model can clearly get further, as someone on the same processor/operating-system combination did. However, whether a \'negative pressure\' or such awaits just short of completion is unclear. There have been occasions where the CPU time has spuriously reset to zero, but the model has carried on as normal (though usually on Linux); the only consequence of that is a spurious seconds/timestep calculation. Someone else may have an idea what would cause a real rewind to the start (in a coupled model). ID: 33500 · Reply Quote

old_user22652 Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0	Message 33504 - Posted: 21 Apr 2008, 14:12:03 UTC - in response to Message 33492. From MikeMars UK: Ã¢â‚¬Å“That\'s an interesting point, we could be seeing a lot more of these over the next few months ...Ã¢â‚¬Â HereÃ¢â‚¬â„¢s another, which again crashed at 87%: 21/04/2008 14:46:54\|climateprediction.net\|Aborting task hadcm3istd_9am9_1920_160_15921938_5: exceeded CPU time limit 7618534.482759 The <rsc_fpops_bound> number again began with 21, which IÃ¢â‚¬â„¢ve doubled; weÃ¢â‚¬â„¢ll see how it goes. John ID: 33504 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 33508 - Posted: 21 Apr 2008, 17:57:55 UTC Thanks for the news of that extra model, which I found. On your list of computers (what a lot there are!) http://climateapps2.oucs.ox.ac.uk/cpdnboinc/hosts_user.php?userid=22652 I think that on computers #7, 8 and 11 there are also 160-year models with the incorrect fpops number. These are all models that were created on 20 Dec or 2 Jan, though you probably downloaded them on later dates. You could wait to see whether they crash or not, or you could edit their fpops number now if you prefer. Then you will be almost certain that they won\'t crash. Cpdn news ID: 33508 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 33509 - Posted: 21 Apr 2008, 18:06:21 UTC Hi again GW On your computer #10 you have Task/result #7181198 which was created on 11 January: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7181198 I think the fpops value for this model will be normal because it was created later and is part of a different batch. Cpdn news ID: 33509 · Reply Quote

old_user22652 Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0	Message 33510 - Posted: 21 Apr 2008, 18:43:46 UTC - in response to Message 33508. Many thanks, Mo. In fact I have already hunted out the ones with low fpops numbers and edited them pre-emptively. John. ID: 33510 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 33511 - Posted: 21 Apr 2008, 19:21:49 UTC Preemptive editing is the way to go. Many thanks for all your very patient help which has been useful to us in identifying which batches of models are affected. Was the model that I mentioned 2 posts above (created 11 Jan) normal? Cpdn news ID: 33511 · Reply Quote

old_user22652 Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0	Message 33513 - Posted: 21 Apr 2008, 20:22:48 UTC - in response to Message 33511. Sorry, I meant to mention that one. Yes, quite normal: fpops number began with 42. John ID: 33513 · Reply Quote