climateprediction.net (CPDN) home page
Thread 'Modelcrash?'

Thread 'Modelcrash?'

Message boards : Number crunching : Modelcrash?
Message board moderation

To post messages, you must log in.

AuthorMessage
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 32493 - Posted: 7 Feb 2008, 6:19:30 UTC

Can someone explain to me what happened with this one?

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7122032

I fully expected it to run for 160 years. This was one of those models which showed a time estimate of 80 years. I edited the client_state.xml. Now the manager is downloading models like there is no tomorrow. It downloaded 3 models the last 2 days, all 160 year models as far as I can see.
ID: 32493 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32494 - Posted: 7 Feb 2008, 6:46:10 UTC


You have a quad core computer.
BOINC has now loaded the 4 processors with 1 model each. This is normal.
The TCM option in prefs can give either an 80 year or an 160 year model, at random.

I can\'t see anything wrong with the indicated model.
What do you think is wrong with it, and why did you edit the xml file?

ID: 32494 · Report as offensive     Reply Quote
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 32506 - Posted: 7 Feb 2008, 15:09:22 UTC - in response to Message 32494.  


You have a quad core computer.
BOINC has now loaded the 4 processors with 1 model each. This is normal.
The TCM option in prefs can give either an 80 year or an 160 year model, at random.

I can\'t see anything wrong with the indicated model.
What do you think is wrong with it, and why did you edit the xml file?



The model in question looked to me like a 160 year model. It quit in 1980 after 60 years. Even if it had been a 80 year model it quit early. I saw no signs of the model going wrong. The reason I did the edit is because initially the model showed a completion time as if it was a 80 year model. I edited the 2 locations as indicated, in the thread about this problem. It is the only reason I can think of for the model to quit early, a wrongly edited client_state file. I waited for a few days for the stderr output to appear on the webpage. When that didn\'t happen I decided to post about it.
ID: 32506 · Report as offensive     Reply Quote
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 32507 - Posted: 7 Feb 2008, 15:12:42 UTC - in response to Message 32494.  


You have a quad core computer.
BOINC has now loaded the 4 processors with 1 model each. This is normal.
The TCM option in prefs can give either an 80 year or an 160 year model, at random.

I can\'t see anything wrong with the indicated model.
What do you think is wrong with it, and why did you edit the xml file?



I do have a quad core but it is not dedicated solely to CPDN and I don\'t crunch 24/7. I estimate 4 models will take me 2 years. I would have preferred that one model at a time was crunched. I hope I will not have memory issues when BOINC decides to run more than one or two models at the same time.
ID: 32507 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,714,904
RAC: 8,478
Message 32508 - Posted: 7 Feb 2008, 15:42:27 UTC - in response to Message 32493.  

... Now the manager is downloading models like there is no tomorrow ...

You need to set \'no new tasks\' for CPDN while you\'re working out what\'s going on.

If you had made a mistake editing client_state.xml, I would have expected it to show up pretty much instantly when you restarted BOINC.
ID: 32508 · Report as offensive     Reply Quote
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 32510 - Posted: 7 Feb 2008, 17:27:11 UTC
Last modified: 7 Feb 2008, 17:32:10 UTC

The model in question was downloaded the 27th of December. I edited the client state file somewhere in the beginning of January.
I don\'t think BOINC will download more than the current 3 models. The LTD for CPDN was a relatively large negative number, now it is positive.
ID: 32510 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32513 - Posted: 7 Feb 2008, 20:22:39 UTC
Last modified: 7 Feb 2008, 20:45:55 UTC

Hi Transient

This is an unfortunate situation. When you first posted and referrred to the crashed model\'s \'time estimate\', several members probably thought you were referring to its deadline and didn\'t immediately understand your problem.

I think you maybe edited the figure in the client_state.xml file wrongly and this crashed the model. Only one number needed to be edited.

Luckily the data you sent for those first 60 model years will be used by the researchers. Even the 60 years means a lot of crunching so thanks for your contribution.

To clarify a bit what the HADCM model names mean. Yor crashed model was called

hadcm3istd_01h7_1920_160_05922419_0

\'1920\' means the model starting year. All HADCM models that begin in 1920 last 160 years. \'160\' means the number of model years. So the 80-year models are called \'......2000_80_......\'.

Yes, some HADCM 160-year models issued in December contained a mistaken figure in their client_state.xml file. They mistakenly got the same figure as 80-year models. However, we didn\'t recommend that everybody should edit the file themselves because even the mistaken lower figure will normally allow these models to complete. They just have a considerably reduced margin to allow for problems with completion. These problems would probably only occur if

* the model crashed and had to be restored from a pretty old backup, meaning lots of extra years of computing (CPU hours)

* the model was transferred to a faster computer

* the model looped and recovered several times, requiring a lot of extra crunching time (CPU hours)

If in one of these rare cases the mistaken figure does cause a model crash, the model can still (as usual) be restored from a backup made before the crash. The xml file can be edited then before crunching any further.

Transient, if you decide that you\'ve downloaded more long models than you want your computer to have, you could abort one or two of the new ones.

If you have a backup from before you made the (probably wrong) edit, you could restore it and crunch on. My correction - no, that was not a good idea - you\'d have to repeat about 3 weeks of crunching.


Cpdn news
ID: 32513 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32514 - Posted: 7 Feb 2008, 20:34:03 UTC
Last modified: 7 Feb 2008, 20:43:38 UTC

The model that crashed is the 4th one here:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=486191

It shows zero CPU time which suggests to me that your edit caused the crash, though not immediately. The date and time of the crash don\'t show.

Transient, your computer has enough RAM to crunch 4 of these HADCM models simultaneously. Each HADCM model needs 512Mb RAM. But if you get Vista you\'ll need more.
Cpdn news
ID: 32514 · Report as offensive     Reply Quote
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 32534 - Posted: 8 Feb 2008, 17:33:37 UTC

I don\'t have a backup for the crashed model. I don\'t consider myself to be knowledgeable enough (Or I might just be too lazy. :D ) to backup models and restore them without affecting the other projects. I read the tutorial on that subject and I decided to leave that alone.

I\'ll have to treat it as a learning experience. Do NOT mess with client_state.xml, if you do not know exactly what you\'re doing.
ID: 32534 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32536 - Posted: 8 Feb 2008, 18:07:21 UTC

I wish you better luck with the other models.

The day you want to learn how to back up the complete contents of the BOINC folder, go to the README about backups and try the first method, explained step by step by Les. It really is easy.
Cpdn news
ID: 32536 · Report as offensive     Reply Quote
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 32543 - Posted: 9 Feb 2008, 9:37:37 UTC

The backup is easy, it is the restoring models which can get tricky.
ID: 32543 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32547 - Posted: 9 Feb 2008, 13:24:26 UTC
Last modified: 9 Feb 2008, 13:30:15 UTC

When you restore the contents of the BOINC folder there aren\'t many extra steps compared with backing up.

* In BOINC manager suspend all work in progress. Close X BOINC manager.

* Exit from BOINC manager by right-clicking on the BOINC icon & selecting Exit.

* Go to your BOINC folder, probably C\\Program files\\BOINC.

* Double-click on the BOINC folder to open up its contents.

* Now the apparently scary bit. You have to empty the BOINC folder to make room for what you\'re going to restore. Edit > Select all > Edit > Delete. Everything disappears. What you\'ve deleted will now be in the Recycle bin and in a worst-case scenario (eg the restore of your backup didn\'t work) you could send the BOINC files in the Recycle bin back to where they came from, and they\'d work again. So you never empty the Recycle bin until the restore is up and running.

* Go back one page so you see the BOINC folder in the list again.

* Keep that window open, make it half-size.

* In a new window, go to your backup, wherever you saved it.

* Double-click on the backup to open its contents.

* Edit > Select all > Edit > Copy

* Make this second window half-size.

* Take your mouse cursor over to the first window, right-click on the BOINC folder and in the menu that opens up, select Paste.

* When all the files have finished transferring, close the 2 windows.

* Start > Programs > click on the BOINC shortcut to start BOINC up again. You\'ll need to open the BOINC manager if it doesn\'t open up automatically, then resume tasks.


Cpdn news
ID: 32547 · Report as offensive     Reply Quote
old_user428438

Send message
Joined: 1 Feb 07
Posts: 26
Credit: 885,216
RAC: 0
Message 32552 - Posted: 9 Feb 2008, 23:48:56 UTC - in response to Message 32547.  
Last modified: 9 Feb 2008, 23:54:44 UTC

When you restore the contents of the BOINC folder there aren\'t many extra steps compared with backing up.

* In BOINC manager suspend all work in progress. Close X BOINC manager.

* Exit from BOINC manager by right-clicking on the BOINC icon & selecting Exit.

* Go to your BOINC folder, probably C\\Program files\\BOINC.

* Double-click on the BOINC folder to open up its contents.

* Now the apparently scary bit. You have to empty the BOINC folder to make room for what you\'re going to restore. Edit > Select all > Edit > Delete. Everything disappears. What you\'ve deleted will now be in the Recycle bin and in a worst-case scenario (eg the restore of your backup didn\'t work) you could send the BOINC files in the Recycle bin back to where they came from, and they\'d work again. So you never empty the Recycle bin until the restore is up and running.

* Go back one page so you see the BOINC folder in the list again.

* Keep that window open, make it half-size.

* In a new window, go to your backup, wherever you saved it.

* Double-click on the backup to open its contents.

* Edit > Select all > Edit > Copy

* Make this second window half-size.

* Take your mouse cursor over to the first window, right-click on the BOINC folder and in the menu that opens up, select Paste.

* When all the files have finished transferring, close the 2 windows.

* Start > Programs > click on the BOINC shortcut to start BOINC up again. You\'ll need to open the BOINC manager if it doesn\'t open up automatically, then resume tasks.



This assumes you are running only CPDN. It\'s when you are running multiple projects that it gets \"tricky\".

F.
ID: 32552 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32553 - Posted: 10 Feb 2008, 0:38:08 UTC
Last modified: 10 Feb 2008, 0:58:00 UTC

Yes, for multi-project crunchers it\'s a problem. In the README about backups there\'s a method for selectively restoring a single climate model from an earlier backup without needing to delete the current contents of the BOINC folder. It\'s meticulously explained and it worked for PeteB who wrote the post. But it does look rather complicated.

If your climate model crashes when you have dozens or hundreds of short tasks from other projects waiting to be crunched, you have to choose whether to continue crunching and abandon the climate model, or restore a backup and abandon the tasks from other projects.

An alternative method after a climate model crash is to set all the projects to No new tasks, crunch the short tasks until they\'re all completed and reported, and then restore the backup. After the restore the BOINC manager Tasks window would probably again be full of short tasks from other projects. You could abort all these tasks except the climate model because you would already have completed them. This would be my method of choice because I would keep a \'clean\' record on all the projects and it\'s technically simple. You can choose your own best moment to restore the backup - up to about a month after the climate model crash. (After 6 weeks of no trickles the CPDN server might send the model to another computer.)

For multi-project crunchers with more than one computer the easiest solution is to reserve a computer just for CPDN models and only make BOINC folder backups on this computer.

Some of the longer QMC tasks are worth backing up part way through.


Cpdn news
ID: 32553 · Report as offensive     Reply Quote

Message boards : Number crunching : Modelcrash?

©2024 cpdn.org