climateprediction.net (CPDN) home page
Thread '\'Maximum CPU time exceeded\' crash fixed'

Thread '\'Maximum CPU time exceeded\' crash fixed'

Message boards : Number crunching : \'Maximum CPU time exceeded\' crash fixed
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user22652

Send message
Joined: 3 Oct 04
Posts: 39
Credit: 13,172,838
RAC: 0
Message 33479 - Posted: 20 Apr 2008, 16:52:15 UTC

Checking my computers the other day, I found one model had crashed after 2276 hours of processing, approx. 87% complete.
I restored a backup from the previous day, but it crashed again at the same point, with the message “exceeded CPU time limit.” My thoughts were those of Douglas Adams’ bowl of petunias: “Oh no, not again!”
I managed to find a post from mo.v from 7/5/07 and followed the advice: find <rsc_fpops_bound> in Client_State.xml, and double the number found there. It was, as she said, “surprisingly easy”.
Well, the model is now processing well past 2276 hrs, and looking to complete in around two weeks. Many thanks, mo.v; petunias nothing.

John GW3PRV
ID: 33479 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 33481 - Posted: 20 Apr 2008, 17:51:20 UTC


Twin congratulations on having a backup, and also getting it working again :-)

I take it this was a model which had migrated from a slower computer to a faster one?


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 33481 · Report as offensive     Reply Quote
old_user22652

Send message
Joined: 3 Oct 04
Posts: 39
Credit: 13,172,838
RAC: 0
Message 33482 - Posted: 20 Apr 2008, 19:05:41 UTC - in response to Message 33481.  

No, just a boring old Athlon 4000. I\'ve become very enthusiastic about backups recently, having had several crashes in the last week or so. Perhaps \"paranoid\" would be a better word.
ID: 33482 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 33483 - Posted: 20 Apr 2008, 21:40:18 UTC


Curious... were the benchmarks on that PC ever unusually high?

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 33483 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 33484 - Posted: 20 Apr 2008, 21:52:14 UTC

Could this be another side-effect of the 8-zip 160-year models?

This looks to be the one: 7123896, and it\'s the right date ...
ID: 33484 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33485 - Posted: 20 Apr 2008, 23:15:41 UTC

Hi GW

I\'m glad that the method helped save your model too. When this happened to a model of mine some time after moving it to a much faster computer, I was lucky that Thyme Lawn immediately diagnosed the cause and that I\'d also been making backups.

There was a bunch of 160-year HADCM workunits issued about 4 months ago where the fpops limit was preset lower than usual. Most of these models will be perfectly OK and will complete within the lower limit set, but if any of them for example loop a few extra years, or crash for other reasons and have to crunch extra years after being restored, or the computer is frequently turned off sending the model back to the last checkpoint, they may get precariously close to the preset maximum processing time allowed.

GW, would you mind changing the title of your thread to something like \'\'Maximum CPU time exceeded\' crash fixed\' so the topic becomes easier to find? (I don\'t mean to take away your success!)

Cpdn news
ID: 33485 · Report as offensive     Reply Quote
old_user22652

Send message
Joined: 3 Oct 04
Posts: 39
Credit: 13,172,838
RAC: 0
Message 33490 - Posted: 21 Apr 2008, 6:24:31 UTC - in response to Message 33485.  

Iain has identified the right model. It has always run on this computer; I haven’t noticed any abnormal benchmarks.

Mo.v: the model is stopped every day to allow a backup to be made; I stop it without reference to the checkpoint counter, so I guess it may have had extra processing to do. The number in <rsc_fpops_bound> began with 21 which I doubled to 42. (Douglas Adams again?)

Would this be one of the models you referred to?

John.
ID: 33490 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 33492 - Posted: 21 Apr 2008, 8:03:32 UTC - in response to Message 33484.  
Last modified: 21 Apr 2008, 12:10:55 UTC

Could this be another side-effect of the 8-zip 160-year models?
...


That\'s an interesting point, we could be seeing a lot more of these over the next few months ... it may be worth a few extra stickys?

Iain has identified the right model. It has always run on this computer; I haven’t noticed any abnormal benchmarks.
...


I think Iain has identified the cause - there were a bunch of models produced which thought they were 80-year models when they were really 160 year models. So they had the lower limits on cpu usage and so forth. Fortunately being coupled models most of the important data is uploaded in the trickles as the model runs, so reaching the end, while ideal, is not critical.

-- Edit:

Fixed attribution :-)
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 33492 · Report as offensive     Reply Quote
old_user428438

Send message
Joined: 1 Feb 07
Posts: 26
Credit: 885,216
RAC: 0
Message 33498 - Posted: 21 Apr 2008, 10:14:11 UTC - in response to Message 33492.  
Last modified: 21 Apr 2008, 10:14:55 UTC

Could this be another side-effect of the 8-zip 160-year models?
...


That\'s an interesting point, we could be seeing a lot more of these over the next few months ... it may be worth a few extra stickys?

Iain has identified the right model. It has always run on this computer; I haven’t noticed any abnormal benchmarks.
...


I think Les has identified the cause - there were a bunch of models produced which thought they were 80-year models when they were really 160 year models. So they had the lower limits on cpu usage and so forth. Fortunately being coupled models most of the important data is uploaded in the trickles as the model runs, so reaching the end, while ideal, is not critical.

I have one these models here which I suspended several weeks ago. I had edited the client_state file as documented here and it happily crunched to the 80 year mark and returned 8 zip files. Shortly thereafter it reset to zero CPU time. Since there was a wingman who seemed to be progressing satisfactorily, I suspended that WU and allowed a shorter model to crunch through and then started on another 160 year model.

It now appears that my co-cruncher on this model may have hit a problem since his last trickle was on 20 March (see here.

Once I have completed my current model (in just over a month), I could go back to this one but I guess that, since the time was reset, that would give 35 days of duplicate trickles before it reaches the point at which it reset and then I find out if it resets again!!

My current target is to have at least as many successful runs as failed runs on my account so I am looking for any thoughts on anything further that I can check to increase the likelyhood that my next attempt with this model will succeed.

F.
ID: 33498 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 33500 - Posted: 21 Apr 2008, 11:45:51 UTC - in response to Message 33498.  
Last modified: 21 Apr 2008, 11:47:24 UTC

... My current target is to have at least as many successful runs as failed runs on my account so I am looking for any thoughts on anything further that I can check to increase the likelyhood that my next attempt with this model will succeed.

F.

The situation is somewhat complicated by another bug - the contraction of some model error reports. So, the model with most progress in that work unit has crashed with a 22 error, which could be PC-induced but could also be \'negative pressure\' etc. - the necessary text is missing. The machine running that model is a Core 2 Duo and was running another long model in tandem, which finished on 9 April; that user also uploads trickles in batches, but has downloaded another long model. I doubt the crashed model will be finished.

From your point of view, the model is going to be a failure unless you carry on with it. The model can clearly get further, as someone on the same processor/operating-system combination did. However, whether a \'negative pressure\' or such awaits just short of completion is unclear.

There have been occasions where the CPU time has spuriously reset to zero, but the model has carried on as normal (though usually on Linux); the only consequence of that is a spurious seconds/timestep calculation. Someone else may have an idea what would cause a real rewind to the start (in a coupled model).
ID: 33500 · Report as offensive     Reply Quote
old_user22652

Send message
Joined: 3 Oct 04
Posts: 39
Credit: 13,172,838
RAC: 0
Message 33504 - Posted: 21 Apr 2008, 14:12:03 UTC - in response to Message 33492.  


From MikeMars UK: “That\'s an interesting point, we could be seeing a lot more of these over the next few months ...”

Here’s another, which again crashed at 87%:

21/04/2008 14:46:54|climateprediction.net|Aborting task hadcm3istd_9am9_1920_160_15921938_5: exceeded CPU time limit 7618534.482759

The <rsc_fpops_bound> number again began with 21, which I’ve doubled; we’ll see how it goes.

John
ID: 33504 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33508 - Posted: 21 Apr 2008, 17:57:55 UTC

Thanks for the news of that extra model, which I found.

On your list of computers (what a lot there are!)

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/hosts_user.php?userid=22652
I think that on computers #7, 8 and 11 there are also 160-year models with the incorrect fpops number. These are all models that were created on 20 Dec or 2 Jan, though you probably downloaded them on later dates.

You could wait to see whether they crash or not, or you could edit their fpops number now if you prefer. Then you will be almost certain that they won\'t crash.
Cpdn news
ID: 33508 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33509 - Posted: 21 Apr 2008, 18:06:21 UTC

Hi again GW

On your computer #10 you have Task/result #7181198 which was created on 11 January:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7181198
I think the fpops value for this model will be normal because it was created later and is part of a different batch.
Cpdn news
ID: 33509 · Report as offensive     Reply Quote
old_user22652

Send message
Joined: 3 Oct 04
Posts: 39
Credit: 13,172,838
RAC: 0
Message 33510 - Posted: 21 Apr 2008, 18:43:46 UTC - in response to Message 33508.  

Many thanks, Mo. In fact I have already hunted out the ones with low fpops numbers and edited them pre-emptively.

John.
ID: 33510 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33511 - Posted: 21 Apr 2008, 19:21:49 UTC

Preemptive editing is the way to go.

Many thanks for all your very patient help which has been useful to us in identifying which batches of models are affected. Was the model that I mentioned 2 posts above (created 11 Jan) normal?


Cpdn news
ID: 33511 · Report as offensive     Reply Quote
old_user22652

Send message
Joined: 3 Oct 04
Posts: 39
Credit: 13,172,838
RAC: 0
Message 33513 - Posted: 21 Apr 2008, 20:22:48 UTC - in response to Message 33511.  

Sorry, I meant to mention that one. Yes, quite normal: fpops number began with 42.

John
ID: 33513 · Report as offensive     Reply Quote

Message boards : Number crunching : \'Maximum CPU time exceeded\' crash fixed

©2024 cpdn.org