Message boards : Number crunching : Modelcrash?
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0 |
Can someone explain to me what happened with this one? http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7122032 I fully expected it to run for 160 years. This was one of those models which showed a time estimate of 80 years. I edited the client_state.xml. Now the manager is downloading models like there is no tomorrow. It downloaded 3 models the last 2 days, all 160 year models as far as I can see. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
You have a quad core computer. BOINC has now loaded the 4 processors with 1 model each. This is normal. The TCM option in prefs can give either an 80 year or an 160 year model, at random. I can\'t see anything wrong with the indicated model. What do you think is wrong with it, and why did you edit the xml file? |
Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0 |
The model in question looked to me like a 160 year model. It quit in 1980 after 60 years. Even if it had been a 80 year model it quit early. I saw no signs of the model going wrong. The reason I did the edit is because initially the model showed a completion time as if it was a 80 year model. I edited the 2 locations as indicated, in the thread about this problem. It is the only reason I can think of for the model to quit early, a wrongly edited client_state file. I waited for a few days for the stderr output to appear on the webpage. When that didn\'t happen I decided to post about it. |
Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0 |
I do have a quad core but it is not dedicated solely to CPDN and I don\'t crunch 24/7. I estimate 4 models will take me 2 years. I would have preferred that one model at a time was crunched. I hope I will not have memory issues when BOINC decides to run more than one or two models at the same time. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,728,179 RAC: 7,202 |
... Now the manager is downloading models like there is no tomorrow ... You need to set \'no new tasks\' for CPDN while you\'re working out what\'s going on. If you had made a mistake editing client_state.xml, I would have expected it to show up pretty much instantly when you restarted BOINC. |
Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0 |
The model in question was downloaded the 27th of December. I edited the client state file somewhere in the beginning of January. I don\'t think BOINC will download more than the current 3 models. The LTD for CPDN was a relatively large negative number, now it is positive. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Transient This is an unfortunate situation. When you first posted and referrred to the crashed model\'s \'time estimate\', several members probably thought you were referring to its deadline and didn\'t immediately understand your problem. I think you maybe edited the figure in the client_state.xml file wrongly and this crashed the model. Only one number needed to be edited. Luckily the data you sent for those first 60 model years will be used by the researchers. Even the 60 years means a lot of crunching so thanks for your contribution. To clarify a bit what the HADCM model names mean. Yor crashed model was called hadcm3istd_01h7_1920_160_05922419_0 \'1920\' means the model starting year. All HADCM models that begin in 1920 last 160 years. \'160\' means the number of model years. So the 80-year models are called \'......2000_80_......\'. Yes, some HADCM 160-year models issued in December contained a mistaken figure in their client_state.xml file. They mistakenly got the same figure as 80-year models. However, we didn\'t recommend that everybody should edit the file themselves because even the mistaken lower figure will normally allow these models to complete. They just have a considerably reduced margin to allow for problems with completion. These problems would probably only occur if * the model crashed and had to be restored from a pretty old backup, meaning lots of extra years of computing (CPU hours) * the model was transferred to a faster computer * the model looped and recovered several times, requiring a lot of extra crunching time (CPU hours) If in one of these rare cases the mistaken figure does cause a model crash, the model can still (as usual) be restored from a backup made before the crash. The xml file can be edited then before crunching any further. Transient, if you decide that you\'ve downloaded more long models than you want your computer to have, you could abort one or two of the new ones. If you have a backup from before you made the (probably wrong) edit, you could restore it and crunch on. My correction - no, that was not a good idea - you\'d have to repeat about 3 weeks of crunching. Cpdn news |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
The model that crashed is the 4th one here: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=486191 It shows zero CPU time which suggests to me that your edit caused the crash, though not immediately. The date and time of the crash don\'t show. Transient, your computer has enough RAM to crunch 4 of these HADCM models simultaneously. Each HADCM model needs 512Mb RAM. But if you get Vista you\'ll need more. Cpdn news |
Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0 |
I don\'t have a backup for the crashed model. I don\'t consider myself to be knowledgeable enough (Or I might just be too lazy. :D ) to backup models and restore them without affecting the other projects. I read the tutorial on that subject and I decided to leave that alone. I\'ll have to treat it as a learning experience. Do NOT mess with client_state.xml, if you do not know exactly what you\'re doing. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I wish you better luck with the other models. The day you want to learn how to back up the complete contents of the BOINC folder, go to the README about backups and try the first method, explained step by step by Les. It really is easy. Cpdn news |
Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0 |
The backup is easy, it is the restoring models which can get tricky. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
When you restore the contents of the BOINC folder there aren\'t many extra steps compared with backing up. * In BOINC manager suspend all work in progress. Close X BOINC manager. * Exit from BOINC manager by right-clicking on the BOINC icon & selecting Exit. * Go to your BOINC folder, probably C\\Program files\\BOINC. * Double-click on the BOINC folder to open up its contents. * Now the apparently scary bit. You have to empty the BOINC folder to make room for what you\'re going to restore. Edit > Select all > Edit > Delete. Everything disappears. What you\'ve deleted will now be in the Recycle bin and in a worst-case scenario (eg the restore of your backup didn\'t work) you could send the BOINC files in the Recycle bin back to where they came from, and they\'d work again. So you never empty the Recycle bin until the restore is up and running. * Go back one page so you see the BOINC folder in the list again. * Keep that window open, make it half-size. * In a new window, go to your backup, wherever you saved it. * Double-click on the backup to open its contents. * Edit > Select all > Edit > Copy * Make this second window half-size. * Take your mouse cursor over to the first window, right-click on the BOINC folder and in the menu that opens up, select Paste. * When all the files have finished transferring, close the 2 windows. * Start > Programs > click on the BOINC shortcut to start BOINC up again. You\'ll need to open the BOINC manager if it doesn\'t open up automatically, then resume tasks. Cpdn news |
Send message Joined: 1 Feb 07 Posts: 26 Credit: 885,216 RAC: 0 |
When you restore the contents of the BOINC folder there aren\'t many extra steps compared with backing up. This assumes you are running only CPDN. It\'s when you are running multiple projects that it gets \"tricky\". F. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Yes, for multi-project crunchers it\'s a problem. In the README about backups there\'s a method for selectively restoring a single climate model from an earlier backup without needing to delete the current contents of the BOINC folder. It\'s meticulously explained and it worked for PeteB who wrote the post. But it does look rather complicated. If your climate model crashes when you have dozens or hundreds of short tasks from other projects waiting to be crunched, you have to choose whether to continue crunching and abandon the climate model, or restore a backup and abandon the tasks from other projects. An alternative method after a climate model crash is to set all the projects to No new tasks, crunch the short tasks until they\'re all completed and reported, and then restore the backup. After the restore the BOINC manager Tasks window would probably again be full of short tasks from other projects. You could abort all these tasks except the climate model because you would already have completed them. This would be my method of choice because I would keep a \'clean\' record on all the projects and it\'s technically simple. You can choose your own best moment to restore the backup - up to about a month after the climate model crash. (After 6 weeks of no trickles the CPDN server might send the model to another computer.) For multi-project crunchers with more than one computer the easiest solution is to reserve a computer just for CPDN models and only make BOINC folder backups on this computer. Some of the longer QMC tasks are worth backing up part way through. Cpdn news |
©2024 cpdn.org