Message boards : Number crunching : \'Maximum CPU time exceeded\' crash fixed
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0 |
Checking my computers the other day, I found one model had crashed after 2276 hours of processing, approx. 87% complete. I restored a backup from the previous day, but it crashed again at the same point, with the message “exceeded CPU time limit.†My thoughts were those of Douglas Adams’ bowl of petunias: “Oh no, not again!†I managed to find a post from mo.v from 7/5/07 and followed the advice: find <rsc_fpops_bound> in Client_State.xml, and double the number found there. It was, as she said, “surprisingly easyâ€Â. Well, the model is now processing well past 2276 hrs, and looking to complete in around two weeks. Many thanks, mo.v; petunias nothing. John GW3PRV |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Twin congratulations on having a backup, and also getting it working again :-) I take it this was a model which had migrated from a slower computer to a faster one? I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0 |
No, just a boring old Athlon 4000. I\'ve become very enthusiastic about backups recently, having had several crashes in the last week or so. Perhaps \"paranoid\" would be a better word. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Curious... were the benchmarks on that PC ever unusually high? I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
Could this be another side-effect of the 8-zip 160-year models? This looks to be the one: 7123896, and it\'s the right date ... |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi GW I\'m glad that the method helped save your model too. When this happened to a model of mine some time after moving it to a much faster computer, I was lucky that Thyme Lawn immediately diagnosed the cause and that I\'d also been making backups. There was a bunch of 160-year HADCM workunits issued about 4 months ago where the fpops limit was preset lower than usual. Most of these models will be perfectly OK and will complete within the lower limit set, but if any of them for example loop a few extra years, or crash for other reasons and have to crunch extra years after being restored, or the computer is frequently turned off sending the model back to the last checkpoint, they may get precariously close to the preset maximum processing time allowed. GW, would you mind changing the title of your thread to something like \'\'Maximum CPU time exceeded\' crash fixed\' so the topic becomes easier to find? (I don\'t mean to take away your success!) Cpdn news |
Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0 |
Iain has identified the right model. It has always run on this computer; I haven’t noticed any abnormal benchmarks. Mo.v: the model is stopped every day to allow a backup to be made; I stop it without reference to the checkpoint counter, so I guess it may have had extra processing to do. The number in <rsc_fpops_bound> began with 21 which I doubled to 42. (Douglas Adams again?) Would this be one of the models you referred to? John. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Could this be another side-effect of the 8-zip 160-year models? That\'s an interesting point, we could be seeing a lot more of these over the next few months ... it may be worth a few extra stickys? Iain has identified the right model. It has always run on this computer; I haven’t noticed any abnormal benchmarks. I think Iain has identified the cause - there were a bunch of models produced which thought they were 80-year models when they were really 160 year models. So they had the lower limits on cpu usage and so forth. Fortunately being coupled models most of the important data is uploaded in the trickles as the model runs, so reaching the end, while ideal, is not critical. -- Edit: Fixed attribution :-) I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 1 Feb 07 Posts: 26 Credit: 885,216 RAC: 0 |
Could this be another side-effect of the 8-zip 160-year models? I have one these models here which I suspended several weeks ago. I had edited the client_state file as documented here and it happily crunched to the 80 year mark and returned 8 zip files. Shortly thereafter it reset to zero CPU time. Since there was a wingman who seemed to be progressing satisfactorily, I suspended that WU and allowed a shorter model to crunch through and then started on another 160 year model. It now appears that my co-cruncher on this model may have hit a problem since his last trickle was on 20 March (see here. Once I have completed my current model (in just over a month), I could go back to this one but I guess that, since the time was reset, that would give 35 days of duplicate trickles before it reaches the point at which it reset and then I find out if it resets again!! My current target is to have at least as many successful runs as failed runs on my account so I am looking for any thoughts on anything further that I can check to increase the likelyhood that my next attempt with this model will succeed. F. |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
... My current target is to have at least as many successful runs as failed runs on my account so I am looking for any thoughts on anything further that I can check to increase the likelyhood that my next attempt with this model will succeed. The situation is somewhat complicated by another bug - the contraction of some model error reports. So, the model with most progress in that work unit has crashed with a 22 error, which could be PC-induced but could also be \'negative pressure\' etc. - the necessary text is missing. The machine running that model is a Core 2 Duo and was running another long model in tandem, which finished on 9 April; that user also uploads trickles in batches, but has downloaded another long model. I doubt the crashed model will be finished. From your point of view, the model is going to be a failure unless you carry on with it. The model can clearly get further, as someone on the same processor/operating-system combination did. However, whether a \'negative pressure\' or such awaits just short of completion is unclear. There have been occasions where the CPU time has spuriously reset to zero, but the model has carried on as normal (though usually on Linux); the only consequence of that is a spurious seconds/timestep calculation. Someone else may have an idea what would cause a real rewind to the start (in a coupled model). |
Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0 |
From MikeMars UK: “That\'s an interesting point, we could be seeing a lot more of these over the next few months ...†Here’s another, which again crashed at 87%: 21/04/2008 14:46:54|climateprediction.net|Aborting task hadcm3istd_9am9_1920_160_15921938_5: exceeded CPU time limit 7618534.482759 The <rsc_fpops_bound> number again began with 21, which I’ve doubled; we’ll see how it goes. John |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Thanks for the news of that extra model, which I found. On your list of computers (what a lot there are!) http://climateapps2.oucs.ox.ac.uk/cpdnboinc/hosts_user.php?userid=22652 I think that on computers #7, 8 and 11 there are also 160-year models with the incorrect fpops number. These are all models that were created on 20 Dec or 2 Jan, though you probably downloaded them on later dates. You could wait to see whether they crash or not, or you could edit their fpops number now if you prefer. Then you will be almost certain that they won\'t crash. Cpdn news |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi again GW On your computer #10 you have Task/result #7181198 which was created on 11 January: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7181198 I think the fpops value for this model will be normal because it was created later and is part of a different batch. Cpdn news |
Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0 |
Many thanks, Mo. In fact I have already hunted out the ones with low fpops numbers and edited them pre-emptively. John. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Preemptive editing is the way to go. Many thanks for all your very patient help which has been useful to us in identifying which batches of models are affected. Was the model that I mentioned 2 posts above (created 11 Jan) normal? Cpdn news |
Send message Joined: 3 Oct 04 Posts: 39 Credit: 13,172,838 RAC: 0 |
Sorry, I meant to mention that one. Yes, quite normal: fpops number began with 42. John |
©2024 cpdn.org