climateprediction.net (CPDN) home page
Thread 'Work done reverted back to Zero'

Thread 'Work done reverted back to Zero'

Message boards : Number crunching : Work done reverted back to Zero
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 30696 - Posted: 24 Sep 2007, 21:59:04 UTC

Thanks for that. Given that the box\'s Fedora7 installation life expectancy is about two weeks, I think I\'ll hold off until until openSuSE 10.3 is installed, then implement the fruits of your work.

Interesting article.

Thanks again.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 30696 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 31328 - Posted: 12 Nov 2007, 0:51:35 UTC

A bit of an update on this problem I had back in September.
That model is now up to 96.8% and looks like it will complete despite the CPU time information being incorrect. No other problems with this model.

But alas the second model I was running of the same type (hadcm3inct_cmf7_1920_160_55869263) has now suffered from the same malady as the first.
I checked the Boinc progress on my computer and found that the counter had reset back to zero for Time processed, Time to completion and Time left.
After working for a bit longer and the checkpoint picked up again the Percentage done came back to where it was before but the time counters stayed at the reset values, just like the first one did.

I have updated to the 5.10.21 since the first report and all has been running fine till this has happened again.

It is only happening to CPDN, so perhaps on these extra long WU\'s Boinc manager is losing track of things?
This computer runs 10 projects and CPDN is the only one doing crazy things.

You can see in the Slot information I copied below where the model time goes from 1,179 hours down to 0.00 hours.
It seems to happen on switching from one project to another and the Shared Memory has to be released, could this be the problem?

So now I have Two models that have decided to reset their stats for no apparent reason, with the CPDN model chugging on as if nothing happened.

I still suspect a BOINC problem monitoring over extended periods of time, or it a problem with the at least one of the other projects that I run. I recall now that I added a couple back about September.


hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1802737 A - 19/06/1990 00:30 - H:M:S=1179:14:14 AVG= 2.35 DLT= 1.00
hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1803169 A - 25/06/1990 00:30 - H:M:S=1179:29:51 AVG= 2.35 DLT= 1.00
Suspended CPDN Monitor - Quit request from BOINC...
Cleaning up graphics data...
Detaching shared memory...
shmget: No such file or directory
Beginning work on result hadcm3inct_cmf7_1920_160_55869263_1...
Starting model in /home/ggoninan/BOINC/projects/climateprediction.net...
Created shared memory region key = 173205 of size 655060 bytes (version 602)
Sorry, BOINC could not open shared graphics library!
Starting model ID hadcm3inct_cmf7_1920_160_55869263 Phase 1
Getting pthread attributes - retval=0
Setting pthread size (100663296 bytes) - retval=0
Executing program hadcm3transum_5.44_i686-pc-linux-gnu 173205
Program launched with process id # 13819
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.
hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1803169 A - 25/06/1990 00:30 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00
scan: cpdnout11.zip
scan: init_data.xml
scan: ozone_hadcm3_1900.gz
scan: DMSallNH3SO21900.gz
scan: cpdnout9.zip
scan: cpdnout13.zip
scan: hdz2hdck_0308_nickfluxcorr.anc.gz
scan: cpdnout15.zip
scan: hadcm3inct_cmf7_1920_160_55869263.zip
scan: volc_v00.gz
scan: cpdnout5.zip
scan: cpdnout16.zip
scan: hadcm3trans_5.41_i686-pc-linux-gnu
scan: hadcm3transse_5.41_i686-pc-linux-gnu.zip
scan: stderr.txt
scan: 1040_flux_corr.anc.gz
scan: ghg_cntrl.gz
scan: SULPC_OXIDANTS_19_A2_1990.mod.gz
scan: hadcm3trans_5.41_i686-pc-linux-gnu.so
scan: 1040_ocean.year.gz
scan: hadcm3transdata_5.41_i686-pc-linux-gnu.zip
scan: cpdnout4.zip
scan: spec3a_sw_3_asol2b_hadcm3.gz
scan: boinc_ufs_cpdnout2.zip
scan: cpdnout7.zip
scan: NAT_VOLC.gz
scan: yafbg.astart.gz
scan: cpdnout3.zip
scan: spec3a_lw_3_asol2c_hadcm3.gz
scan: cpdnout14.zip
scan: boinc_ufs_cpdnout3.zip
scan: boinc_ufs_cpdnout5.zip
scan: boinc_ufs_cpdnout1.zip
scan: SULPC_OXIDANTS_19_A2_1990.gz
scan: cpdnout2.zip
scan: cpdnout6.zip
scan: boinc_ufs_cpdnout6.zip
scan: cpdnout1.zip
scan: cpdnout8.zip
scan: cpdnout10.zip
scan: solar_v00.gz
scan: boinc_lockfile
scan: hadcm3transum_5.41_i686-pc-linux-gnu
scan: cpdnout12.zip
scan: boinc_ufs_cpdnout4.zip
hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1803601 A - 01/07/1990 00:30 - H:M:S=0000:17:16 AVG= 0.00 DLT= 0.96
hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1804033 A - 07/07/1990 00:30 - H:M:S=0000:34:20 AVG= 0.00 DLT= 0.99
hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1804465 A - 13/07/1990 00:30 - H:M:S=0000:51:31 AVG= 0.00 DLT= 0.00
hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1804897 A - 19/07/1990 00:30 - H:M:S=0001:08:35 AVG= 0.00 DLT= 1.00
Resuming CPDN!
ID: 31328 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 31329 - Posted: 12 Nov 2007, 1:58:14 UTC - in response to Message 31328.  

My only suggestion at this point is set NNT for CPDN project. Let finish the task that\'s almost done. Abort the bad one. Report all results. Reset project. Upgrade BOINC to latest *stable* linux version 5.10.21. Try to download a new task for CPDN.
ID: 31329 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 31914 - Posted: 29 Dec 2007, 22:31:04 UTC

A further update.
The WU 6602199 has finished successfully.
The time reported for completion is 6,305,222 seconds (+or- a few) short for the actual time taken.
Other than that it all went well and had no other issues.
You can see in the result output where Boinc Manager reset all counters back to zero but the WU kept on going and was granted full credit for the result.

Very strange.
ID: 31914 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 31918 - Posted: 30 Dec 2007, 1:50:48 UTC


Congratulations! It was a long, determined, effort.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 31918 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32000 - Posted: 4 Jan 2008, 12:54:39 UTC

I had a look at Conan\'s BOINC messages for this model and noticed this group which is repeated at intervals:

Detaching shared memory...
shmget: No such file or directory
Beginning work on result hadcm3inct_cn6q_1920_160_45870254_4...
Starting model in /home/ggoninan/BOINC/projects/climateprediction.net...
Created shared memory region key = 172920 of size 655060 bytes (version 602)
Sorry, BOINC could not open shared graphics library!


Does this indicate that Conan\'s computer ran into the problem of not being able to get shared memory? Mike maybe suspected this way back up this thread when he suggested not to keep the model in memory while suspended. I thought this shmget problem was only likely to occur on Macs.

If anyone thinks it would be relevant or useful for Conan, I can copy a recent post about this by Charlie Fenton on the boinc_alpha mailing list (Charlie writes the BOINC code for Mac). It\'s a very clear explanation of the shared memory problem and contains some practical advice.
Cpdn news
ID: 32000 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32003 - Posted: 4 Jan 2008, 16:40:05 UTC


I think it might mean that the screensaver / show graphics might not work, since the science app wouldn\'t be able to pass the frame back to Boinc for display. But I\'m not sure.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32003 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 32685 - Posted: 21 Feb 2008, 10:19:01 UTC

You wouldn\'t read about it (well you will after I finish writing this).
It has happened again this time during a WU that the stats had already reset once on, now they have reset again.

The percentage done did not change (around 69%) but hours processed and hours to go changed, hours done reset to zero and hours to go reset to a new value of about 100 hours less than before.

It may have something to do with this message :-
(This all appears to have happened as the WU was getting it\'s information ready to send a trickle up message and I recall this is when this problem happened last time as well)


2008-02-20 01:43:05 [climateprediction.net] Task hadcm3inct_cmf7_1920_160_55869263_1 exited with zero status but no \'finished\' file
2008-02-20 01:43:05 [climateprediction.net] If this happens repeatedly you may need to reset the project.
2008-02-20 01:43:05 [climateprediction.net] Restarting task hadcm3inct_cmf7_1920_160_55869263_1 using hadcm3i version 544
2008-02-20 01:43:08 [climateprediction.net] Sending scheduler request: To send trickle-up message
2008-02-20 01:43:08 [climateprediction.net] (not requesting new work or reporting completed tasks)
2008-02-20 01:43:13 [climateprediction.net] Scheduler RPC succeeded [server version 509]

It appears that the WU started again from last checkpoint and in the process Boinc Manager resets the time counters but the progress stays the same.

I did not notice the last time it did this if the WU \'restarted\' or \'resumed\'. If it restarted then that is why the counters reset. If it resumed then it was a Boinc Manager thing?

The WU is still going and should of trickled again since this hiccup but it remains a mystery.

I am unsure if I have changed the Boinc Client version since the last time this happened.
I still think it is a Boinc thing as no other project has had any trouble.
ID: 32685 · Report as offensive     Reply Quote
old_user428438

Send message
Joined: 1 Feb 07
Posts: 26
Credit: 885,216
RAC: 0
Message 32697 - Posted: 21 Feb 2008, 20:00:17 UTC - in response to Message 32685.  

You wouldn\'t read about it (well you will after I finish writing this).
It has happened again this time during a WU that the stats had already reset once on, now they have reset again.

The percentage done did not change (around 69%) but hours processed and hours to go changed, hours done reset to zero and hours to go reset to a new value of about 100 hours less than before.

It may have something to do with this message :-
(This all appears to have happened as the WU was getting it\'s information ready to send a trickle up message and I recall this is when this problem happened last time as well)


2008-02-20 01:43:05 [climateprediction.net] Task hadcm3inct_cmf7_1920_160_55869263_1 exited with zero status but no \'finished\' file
2008-02-20 01:43:05 [climateprediction.net] If this happens repeatedly you may need to reset the project.
2008-02-20 01:43:05 [climateprediction.net] Restarting task hadcm3inct_cmf7_1920_160_55869263_1 using hadcm3i version 544
2008-02-20 01:43:08 [climateprediction.net] Sending scheduler request: To send trickle-up message
2008-02-20 01:43:08 [climateprediction.net] (not requesting new work or reporting completed tasks)
2008-02-20 01:43:13 [climateprediction.net] Scheduler RPC succeeded [server version 509]

It appears that the WU started again from last checkpoint and in the process Boinc Manager resets the time counters but the progress stays the same.

I did not notice the last time it did this if the WU \'restarted\' or \'resumed\'. If it restarted then that is why the counters reset. If it resumed then it was a Boinc Manager thing?

The WU is still going and should of trickled again since this hiccup but it remains a mystery.

I am unsure if I have changed the Boinc Client version since the last time this happened.
I still think it is a Boinc thing as no other project has had any trouble.


I had exactly the same message a couple of days ago, also on a 160 year model (hadcm3istd_4439_1920_160_15921780_5) but it also set my \"Progress\" back to 0%. It had uploaded 83 trickles, so was just over half way through and I don\'t fancy waiting another 40 days for it to catch up with itself. I am crunching CPDN on only 1 core of my quaddy and the other 3 WU\'s that have been downloaded are relative \"quickies\" so, since I had a wingman with a faster box who was already a few years ahead of me, I have suspended that WU (and set \"No New Tasks\") and will crunch the other 3 tasks by which time he should just about have completed it. If he returns an error, meantime, then I will return to it on completion of any of my \"shorties\".

I made no changes whatsoever to my machine that could have caused this error - its occurrence is a total mystery.

F.
ID: 32697 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 32700 - Posted: 22 Feb 2008, 3:19:23 UTC

Fred,

Your Quad had 37 Models. As best I can tell, two short Runs of that lot completed successfully. An 80-year Run, though logged a success, is short some Trickles.

How far overclocked is your machine?

Whether overclocked or not, some stability tests are due. (Hours of Memtest-86+, plus a full day of Prime95 Torture Test (four simultaneous copies).

Your results are not usual for such a machine.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 32700 · Report as offensive     Reply Quote
old_user428438

Send message
Joined: 1 Feb 07
Posts: 26
Credit: 885,216
RAC: 0
Message 32701 - Posted: 22 Feb 2008, 8:49:13 UTC - in response to Message 32700.  
Last modified: 22 Feb 2008, 9:09:17 UTC

Fred,

Your Quad had 37 Models. As best I can tell, two short Runs of that lot completed successfully. An 80-year Run, though logged a success, is short some Trickles.

How far overclocked is your machine?

Whether overclocked or not, some stability tests are due. (Hours of Memtest-86+, plus a full day of Prime95 Torture Test (four simultaneous copies).

Your results are not usual for such a machine.

Thanks for the concern. I am hoping, one day, to have more completed models than failed ones but won\'t hold my breath waiting for that - perhaps a dream more than an expectation!

I had not noticed that the 80 year run was short on trickles, just celebrated a \"Success\", but I guess there is nothing I can do about that now.

I can track pretty well all of the failed WU\'s to specific events in my overclocking adventures on my earlier E6400 or my current Q6600 setup. I do run at least 2 complete cycles of Memtest and 8 hours of Prime95 v25.x (i.e. all 4 cores) after any hardware changes and continuously monitor core temps but occasionally things still happen - e.g. temporarily attach a laptop SATA HD to copy some data and the machine won\'t boot, NB VERY hot; eventually have to replace MoBo; or Windoze decides it is corrupted after an AV update triggered reboot; etc. These events, and others similar, have required a re-load of or re-attach to Boinc and have resulted in the error reports or WU\'s that the system thinks I have, but my machine has lost them.

I was getting quite excited, relatively speaking, at the prospect of completing a 160 year model - machine has been totally stable at 3.336GHz with core temps of 51C since Xmas - and having got past half way I was beginning to feel I was on the downhill. Then it errors and restarts from zero!

Still, another 4 days should see me through the HADAM3 model I am currently crunching and I guess I should schedule a day out to re-run Memtest and Prime95 to be on the safe side (and blow out the dust bunnies while I am at it).

F.
ID: 32701 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32702 - Posted: 22 Feb 2008, 13:33:59 UTC


When I\'m overclocking I use 24 hours of Prime95 (one per core). My Q6600 took a lot to get it completely stable, it had to go down a long way from being \'nearly stable\' to being \'entirely stable\'. The AMDs I\'ve overclocked were much easier, since the grey area between stable and unstable was much narrower.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32702 · Report as offensive     Reply Quote
old_user428438

Send message
Joined: 1 Feb 07
Posts: 26
Credit: 885,216
RAC: 0
Message 32703 - Posted: 22 Feb 2008, 14:24:06 UTC - in response to Message 32702.  


When I\'m overclocking I use 24 hours of Prime95 (one per core). My Q6600 took a lot to get it completely stable, it had to go down a long way from being \'nearly stable\' to being \'entirely stable\'. The AMDs I\'ve overclocked were much easier, since the grey area between stable and unstable was much narrower.

Fair enough. I guess my machine is now booked for a thorough health check early next week.

F.
ID: 32703 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 33177 - Posted: 1 Apr 2008, 14:22:50 UTC

Well I can forget about a happy ending for this WU.
Due to upgrading Boinc to the latest version for Linux 5.10.45, it trashed all work on all projects and created 3 duplicate computers.

A bit crappy for a latest release that has been tested.

Tried restoring the folder but it still died, with 5.10.45 wiping all data in the Boinc folder no matter what I do. I tried rolling back to an older version of Boinc and then rolling forward again but it still wiped everything.

Climate downloaded another WU but I aborted it as I am now very disheartened with the whole thing.

I only had about 100 hours to go despite Boinc resetting my stats on that WU twice and I was looking forward to it finishing.

I may be back but I don\'t know when, we will have to see.
ID: 33177 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33181 - Posted: 1 Apr 2008, 16:18:39 UTC

The whole question of BOINC downloads and upgrades for Linux is a mess as far as I can see. Not just the stability of the BOINC version, but the whole question of how to get it going on all the different distros. If you do crunch a climate model again, it might be worthwhile in future holding off upgrading BOINC until you\'ve completed the last model on the machine and have nothing or nearly nothing from other projects.

I do sympathise. You actually have a very good model completion record on both computers so you\'ve contributed a lot to the project.
Cpdn news
ID: 33181 · Report as offensive     Reply Quote
old_user170894
Avatar

Send message
Joined: 3 Mar 06
Posts: 96
Credit: 353,185
RAC: 0
Message 33184 - Posted: 1 Apr 2008, 22:19:07 UTC - in response to Message 33181.  
Last modified: 1 Apr 2008, 22:20:53 UTC

The whole question of BOINC downloads and upgrades for Linux is a mess as far as I can see. Not just the stability of the BOINC version, but the whole question of how to get it going on all the different distros.


It\'s not as bad as it seems but it\'s definitely not as easy as with Windows. If there were people dedicated to the job of building BOINC install/update packages it would be great but the manpower, expertise or dedication seems to not be there.

If you do crunch a climate model again, it might be worthwhile in future holding off upgrading BOINC until you\'ve completed the last model on the machine and have nothing or nearly nothing from other projects.


Personally, I wouldn\'t dream of updating Linux or BOINC while a CPDN model is running. If it\'s absolutely necessary (and having the latest eye candy or strictly for convenience feature doesn\'t = necessary) then I make redundant backups and have a rock solid plan for rolling back the update if it causes problems. Also, I will not run CPDN parallel with other projects. If I start a model then it runs 24/7 to completion, no preempting by other projects because the sooner you get them done the fewer headaches you have. Then I put CPDN on the back burner for a few months and help other projects. Why take unnecessary risks that have zero payoff?


ID: 33184 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 33238 - Posted: 7 Apr 2008, 9:01:33 UTC - in response to Message 33184.  

The whole question of BOINC downloads and upgrades for Linux is a mess as far as I can see. Not just the stability of the BOINC version, but the whole question of how to get it going on all the different distros.


It\'s not as bad as it seems but it\'s definitely not as easy as with Windows. If there were people dedicated to the job of building BOINC install/update packages it would be great but the manpower, expertise or dedication seems to not be there.

If you do crunch a climate model again, it might be worthwhile in future holding off upgrading BOINC until you\'ve completed the last model on the machine and have nothing or nearly nothing from other projects.


Personally, I wouldn\'t dream of updating Linux or BOINC while a CPDN model is running. If it\'s absolutely necessary (and having the latest eye candy or strictly for convenience feature doesn\'t = necessary) then I make redundant backups and have a rock solid plan for rolling back the update if it causes problems. Also, I will not run CPDN parallel with other projects. If I start a model then it runs 24/7 to completion, no preempting by other projects because the sooner you get them done the fewer headaches you have. Then I put CPDN on the back burner for a few months and help other projects. Why take unnecessary risks that have zero payoff?



Thanks mo.v and Dagorath,
I will keep this in mind if I decide to come back and have another go.

ID: 33238 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Work done reverted back to Zero

©2024 cpdn.org