climateprediction.net (CPDN) home page
Thread 'Wish: checkpoint on exit'

Thread 'Wish: checkpoint on exit'

Questions and Answers : Wish list : Wish: checkpoint on exit
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user101069
Avatar

Send message
Joined: 4 Oct 05
Posts: 12
Credit: 610,967
RAC: 0
Message 20937 - Posted: 1 Mar 2006, 21:25:53 UTC

It seems that removing the wu from memory (e.g. when boinc shuts down, or if you do not have \'Leave applications in memory while preempted?\' selected) does not cause cpdn to write a checkpoint, and so restarts from the last checkpoint.

My request is that this be corrected by creating arbitrary checkpoints that can be done when boinc requests that the app exit.

The wasted cpu cycles caused by the model reverting to the last checkpoint are not great. On average (checkpointing every 3 days, divided by 2) 72 timesteps are wasted. For my pc, I estimate that that amounts to about 5 minutes every 4 hours of work, or a 2% loss. That\'s not terribly significant, but over the life of a model that\'s almost a day.

Another benefit is that users with a very stable system could reduce the number of disk writes using the BOINC preference \"Write to disk at most every\" (Maybe. I\'m not sure if that only refers to boinc system writes...)
ID: 20937 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 21049 - Posted: 4 Mar 2006, 16:30:00 UTC

Yes, agreed. Unfortunately boinc is written by different people from CPDN, so they can\'t directly influence boinc\'s features...
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 21049 · Report as offensive     Reply Quote
old_user101069
Avatar

Send message
Joined: 4 Oct 05
Posts: 12
Credit: 610,967
RAC: 0
Message 21110 - Posted: 6 Mar 2006, 23:56:00 UTC - in response to Message 21049.  

Yes, agreed. Unfortunately boinc is written by different people from CPDN, so they can\'t directly influence boinc\'s features...


The ability to add a preference to how often the app writes to disk is already a feature of boinc, and the writing of checkpoints is not related to boinc at all, as it is done by the science app directly.

I notice that the sulfer models checkpoint more often (after discovering the 8 key) but that has the downside of waiting for the disk more often.
ID: 21110 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 21116 - Posted: 7 Mar 2006, 4:54:45 UTC - in response to Message 21110.  
Last modified: 7 Mar 2006, 5:05:10 UTC

The ability to add a preference to how often the app writes to disk is already a feature of boinc, and the writing of checkpoints is not related to boinc at all, as it is done by the science app directly.

I notice that the sulfer models checkpoint more often (after discovering the 8 key) but that has the downside of waiting for the disk more often.


If memory serves, Carl posted that CPDN writes based on Model requirements, not the boinc preference option.

Sulphur/Sulfur Models checkpoint every 144 Time Steps and I think that is (was?) hard-coded into the count-down. (That made for \"Whoops\" reactions in Spinup [six-day checkpoints, different TS/day count] and I gave up on using it.) SC checkpoints are just before 0030 on the first/fourth/seventh/tenth/... of each month.


Edit: Given that the \"leave in memory\" option overcomes the problem, I doubt there will be a change to the code. If set to leave in memory, it might still get swapped-out if the OS wants the memory; if so, nothing is lost and the Model will continue from its last Time Step when swapped in again. (Plenty of available memory should eliminate most, if not all, swap.)

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 21116 · Report as offensive     Reply Quote
old_user101069
Avatar

Send message
Joined: 4 Oct 05
Posts: 12
Credit: 610,967
RAC: 0
Message 21166 - Posted: 9 Mar 2006, 2:54:19 UTC - in response to Message 21116.  


Sulphur/Sulfur Models checkpoint every 144 Time Steps and I think that is (was?) hard-coded into the count-down. (That made for \"Whoops\" reactions in Spinup [six-day checkpoints, different TS/day count] and I gave up on using it.) SC checkpoints are just before 0030 on the first/fourth/seventh/tenth/... of each month.


Edit: Given that the \"leave in memory\" option overcomes the problem, I doubt there will be a change to the code. If set to leave in memory, it might still get swapped-out if the OS wants the memory; if so, nothing is lost and the Model will continue from its last Time Step when swapped in again. (Plenty of available memory should eliminate most, if not all, swap.)


The leave in memory option doesn\'t overcome the problem, it just alleviates the problem a bit. This was taken into account in the calculations that gave me the 2% loss. I run cpdn at 25% priority, which gives me about 4 hours of run time between reboots on average.

Good to know that the countdown might not be accurate. That also explains why it doesn\'t seem to coencide with disk activity.

Also, I thought hibernating (suspend to disk) instead of shutting down might fix the problem, but it doesn\'t.

Anyway, thanks for your help. Like I said, if it\'s just under two percent, that\'s not so bad, but still it\'s worthy of a wish to me. I will also take this up on the BOINC forums, to see if they can come up with a solution.
ID: 21166 · Report as offensive     Reply Quote

Questions and Answers : Wish list : Wish: checkpoint on exit

©2024 cpdn.org