climateprediction.net (CPDN) home page
Thread 'Model ran fine up to 53% then ran into trouble'

Thread 'Model ran fine up to 53% then ran into trouble'

Questions and Answers : Unix/Linux : Model ran fine up to 53% then ran into trouble
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user204722

Send message
Joined: 22 Oct 06
Posts: 2
Credit: 1,107,647
RAC: 0
Message 32996 - Posted: 16 Mar 2008, 14:11:57 UTC

Hi all,

I\'ve run several models successfully in the past, however hadcm3iozn_cpzl_2000_80_135899450_3 started off like all the others, and ran to 53%.

It would continue to run, however it would not progress beyond a certain point in model time.

My stats for CPDN went from a 50 degree slope to flatline and stayed there for about 2 months. I had to abort this work unit.

questions include:

- who else has had this kind of problem?
- how should I have handled this (e.g. notify the boinc team with the work in progress files?) so the root cause could be addressed?
- would it have been better to figure out how to restart the unit from the beginning?
- do I just abort and let it get another unit?
ID: 32996 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 32998 - Posted: 16 Mar 2008, 16:02:10 UTC

Paul,

Could you post the result id or unhide your PC? We\'ll then be able to have a look at the model in more detail.

Iain
ID: 32998 · Report as offensive     Reply Quote
old_user204722

Send message
Joined: 22 Oct 06
Posts: 2
Credit: 1,107,647
RAC: 0
Message 33006 - Posted: 17 Mar 2008, 21:24:33 UTC - in response to Message 32998.  

Paul,

Could you post the result id or unhide your PC? We\'ll then be able to have a look at the model in more detail.

Iain



Here\'s the work unit (unit ID is 6106824)
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6106824

And here\'s the cpu ID (493017)
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=493017


Is there an easy way to restart a work unit from the start (delete some key status file and restart the app)? I had a power failure at the wrong time last Dec and it might have gotten into a state it couldn\'t recover from.

If I recall correctly, it doesn\'t (or rather didn\'t) compute past 10:30am somewhere in the year 2043.


Good luck.

ID: 33006 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 33008 - Posted: 17 Mar 2008, 21:59:53 UTC


Is there an easy way to restart a work unit from the start


Yes. From a backup of the BOINC folder that was made BEFORE the problem.


Backups: Here
ID: 33008 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33009 - Posted: 17 Mar 2008, 23:11:01 UTC
Last modified: 17 Mar 2008, 23:12:49 UTC

That task last trickled on 27 Dec and you aborted it yesterday Paul.

The best way to avoid this sort of thing happening in the future is for you to choose a method for regularly backing up the contents of your boinc folder. So if a model goes wrong, you restore a backup and crunch from that point. Choose a method from the README about backups (link in my signature). I use Les\'s easy manual method which only takes a few minutes.

Nobody wants to have to go back to the start of a problem model, but with regular backups you can restore and crunch again from the last backup point ie not much computer time is wasted.
Cpdn news
ID: 33009 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 33010 - Posted: 18 Mar 2008, 0:51:51 UTC - in response to Message 33006.  
Last modified: 18 Mar 2008, 0:52:37 UTC

... I had a power failure at the wrong time last Dec and it might have gotten into a state it couldn\'t recover from.

If I recall correctly, it doesn\'t (or rather didn\'t) compute past 10:30am somewhere in the year 2043. ...

Actually, Paul, it may be a duff model. The next model to yours in the work unit has stuck at the same point (6106824). It may be a coincidence, of course.

So, it may be that you couldn\'t have done anything differently. These later 5.44 models aren\'t supposed to loop, so people don\'t look for it any more ...
ID: 33010 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Model ran fine up to 53% then ran into trouble

©2024 cpdn.org