Message boards : Number crunching : Work done reverted back to Zero
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Thanks for that. Given that the box\'s Fedora7 installation life expectancy is about two weeks, I think I\'ll hold off until until openSuSE 10.3 is installed, then implement the fruits of your work. Interesting article. Thanks again. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
A bit of an update on this problem I had back in September. That model is now up to 96.8% and looks like it will complete despite the CPU time information being incorrect. No other problems with this model. But alas the second model I was running of the same type (hadcm3inct_cmf7_1920_160_55869263) has now suffered from the same malady as the first. I checked the Boinc progress on my computer and found that the counter had reset back to zero for Time processed, Time to completion and Time left. After working for a bit longer and the checkpoint picked up again the Percentage done came back to where it was before but the time counters stayed at the reset values, just like the first one did. I have updated to the 5.10.21 since the first report and all has been running fine till this has happened again. It is only happening to CPDN, so perhaps on these extra long WU\'s Boinc manager is losing track of things? This computer runs 10 projects and CPDN is the only one doing crazy things. You can see in the Slot information I copied below where the model time goes from 1,179 hours down to 0.00 hours. It seems to happen on switching from one project to another and the Shared Memory has to be released, could this be the problem? So now I have Two models that have decided to reset their stats for no apparent reason, with the CPDN model chugging on as if nothing happened. I still suspect a BOINC problem monitoring over extended periods of time, or it a problem with the at least one of the other projects that I run. I recall now that I added a couple back about September. hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1802737 A - 19/06/1990 00:30 - H:M:S=1179:14:14 AVG= 2.35 DLT= 1.00 hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1803169 A - 25/06/1990 00:30 - H:M:S=1179:29:51 AVG= 2.35 DLT= 1.00 Suspended CPDN Monitor - Quit request from BOINC... Cleaning up graphics data... Detaching shared memory... shmget: No such file or directory Beginning work on result hadcm3inct_cmf7_1920_160_55869263_1... Starting model in /home/ggoninan/BOINC/projects/climateprediction.net... Created shared memory region key = 173205 of size 655060 bytes (version 602) Sorry, BOINC could not open shared graphics library! Starting model ID hadcm3inct_cmf7_1920_160_55869263 Phase 1 Getting pthread attributes - retval=0 Setting pthread size (100663296 bytes) - retval=0 Executing program hadcm3transum_5.44_i686-pc-linux-gnu 173205 Program launched with process id # 13819 Climate model starting - use graphics to monitor progress. Or visit the website to see the graphs for this run. hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1803169 A - 25/06/1990 00:30 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00 scan: cpdnout11.zip scan: init_data.xml scan: ozone_hadcm3_1900.gz scan: DMSallNH3SO21900.gz scan: cpdnout9.zip scan: cpdnout13.zip scan: hdz2hdck_0308_nickfluxcorr.anc.gz scan: cpdnout15.zip scan: hadcm3inct_cmf7_1920_160_55869263.zip scan: volc_v00.gz scan: cpdnout5.zip scan: cpdnout16.zip scan: hadcm3trans_5.41_i686-pc-linux-gnu scan: hadcm3transse_5.41_i686-pc-linux-gnu.zip scan: stderr.txt scan: 1040_flux_corr.anc.gz scan: ghg_cntrl.gz scan: SULPC_OXIDANTS_19_A2_1990.mod.gz scan: hadcm3trans_5.41_i686-pc-linux-gnu.so scan: 1040_ocean.year.gz scan: hadcm3transdata_5.41_i686-pc-linux-gnu.zip scan: cpdnout4.zip scan: spec3a_sw_3_asol2b_hadcm3.gz scan: boinc_ufs_cpdnout2.zip scan: cpdnout7.zip scan: NAT_VOLC.gz scan: yafbg.astart.gz scan: cpdnout3.zip scan: spec3a_lw_3_asol2c_hadcm3.gz scan: cpdnout14.zip scan: boinc_ufs_cpdnout3.zip scan: boinc_ufs_cpdnout5.zip scan: boinc_ufs_cpdnout1.zip scan: SULPC_OXIDANTS_19_A2_1990.gz scan: cpdnout2.zip scan: cpdnout6.zip scan: boinc_ufs_cpdnout6.zip scan: cpdnout1.zip scan: cpdnout8.zip scan: cpdnout10.zip scan: solar_v00.gz scan: boinc_lockfile scan: hadcm3transum_5.41_i686-pc-linux-gnu scan: cpdnout12.zip scan: boinc_ufs_cpdnout4.zip hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1803601 A - 01/07/1990 00:30 - H:M:S=0000:17:16 AVG= 0.00 DLT= 0.96 hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1804033 A - 07/07/1990 00:30 - H:M:S=0000:34:20 AVG= 0.00 DLT= 0.99 hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1804465 A - 13/07/1990 00:30 - H:M:S=0000:51:31 AVG= 0.00 DLT= 0.00 hadcm3inct_cmf7_1920_160_55869263 - PH 1 TS 1804897 A - 19/07/1990 00:30 - H:M:S=0001:08:35 AVG= 0.00 DLT= 1.00 Resuming CPDN! |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
My only suggestion at this point is set NNT for CPDN project. Let finish the task that\'s almost done. Abort the bad one. Report all results. Reset project. Upgrade BOINC to latest *stable* linux version 5.10.21. Try to download a new task for CPDN. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
A further update. The WU 6602199 has finished successfully. The time reported for completion is 6,305,222 seconds (+or- a few) short for the actual time taken. Other than that it all went well and had no other issues. You can see in the result output where Boinc Manager reset all counters back to zero but the WU kept on going and was granted full credit for the result. Very strange. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Congratulations! It was a long, determined, effort. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I had a look at Conan\'s BOINC messages for this model and noticed this group which is repeated at intervals: Detaching shared memory... shmget: No such file or directory Beginning work on result hadcm3inct_cn6q_1920_160_45870254_4... Starting model in /home/ggoninan/BOINC/projects/climateprediction.net... Created shared memory region key = 172920 of size 655060 bytes (version 602) Sorry, BOINC could not open shared graphics library! Does this indicate that Conan\'s computer ran into the problem of not being able to get shared memory? Mike maybe suspected this way back up this thread when he suggested not to keep the model in memory while suspended. I thought this shmget problem was only likely to occur on Macs. If anyone thinks it would be relevant or useful for Conan, I can copy a recent post about this by Charlie Fenton on the boinc_alpha mailing list (Charlie writes the BOINC code for Mac). It\'s a very clear explanation of the shared memory problem and contains some practical advice. Cpdn news |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I think it might mean that the screensaver / show graphics might not work, since the science app wouldn\'t be able to pass the frame back to Boinc for display. But I\'m not sure. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
You wouldn\'t read about it (well you will after I finish writing this). It has happened again this time during a WU that the stats had already reset once on, now they have reset again. The percentage done did not change (around 69%) but hours processed and hours to go changed, hours done reset to zero and hours to go reset to a new value of about 100 hours less than before. It may have something to do with this message :- (This all appears to have happened as the WU was getting it\'s information ready to send a trickle up message and I recall this is when this problem happened last time as well) 2008-02-20 01:43:05 [climateprediction.net] Task hadcm3inct_cmf7_1920_160_55869263_1 exited with zero status but no \'finished\' file 2008-02-20 01:43:05 [climateprediction.net] If this happens repeatedly you may need to reset the project. 2008-02-20 01:43:05 [climateprediction.net] Restarting task hadcm3inct_cmf7_1920_160_55869263_1 using hadcm3i version 544 2008-02-20 01:43:08 [climateprediction.net] Sending scheduler request: To send trickle-up message 2008-02-20 01:43:08 [climateprediction.net] (not requesting new work or reporting completed tasks) 2008-02-20 01:43:13 [climateprediction.net] Scheduler RPC succeeded [server version 509] It appears that the WU started again from last checkpoint and in the process Boinc Manager resets the time counters but the progress stays the same. I did not notice the last time it did this if the WU \'restarted\' or \'resumed\'. If it restarted then that is why the counters reset. If it resumed then it was a Boinc Manager thing? The WU is still going and should of trickled again since this hiccup but it remains a mystery. I am unsure if I have changed the Boinc Client version since the last time this happened. I still think it is a Boinc thing as no other project has had any trouble. |
Send message Joined: 1 Feb 07 Posts: 26 Credit: 885,216 RAC: 0 |
You wouldn\'t read about it (well you will after I finish writing this). I had exactly the same message a couple of days ago, also on a 160 year model (hadcm3istd_4439_1920_160_15921780_5) but it also set my \"Progress\" back to 0%. It had uploaded 83 trickles, so was just over half way through and I don\'t fancy waiting another 40 days for it to catch up with itself. I am crunching CPDN on only 1 core of my quaddy and the other 3 WU\'s that have been downloaded are relative \"quickies\" so, since I had a wingman with a faster box who was already a few years ahead of me, I have suspended that WU (and set \"No New Tasks\") and will crunch the other 3 tasks by which time he should just about have completed it. If he returns an error, meantime, then I will return to it on completion of any of my \"shorties\". I made no changes whatsoever to my machine that could have caused this error - its occurrence is a total mystery. F. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Fred, Your Quad had 37 Models. As best I can tell, two short Runs of that lot completed successfully. An 80-year Run, though logged a success, is short some Trickles. How far overclocked is your machine? Whether overclocked or not, some stability tests are due. (Hours of Memtest-86+, plus a full day of Prime95 Torture Test (four simultaneous copies). Your results are not usual for such a machine. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 1 Feb 07 Posts: 26 Credit: 885,216 RAC: 0 |
Fred, Thanks for the concern. I am hoping, one day, to have more completed models than failed ones but won\'t hold my breath waiting for that - perhaps a dream more than an expectation! I had not noticed that the 80 year run was short on trickles, just celebrated a \"Success\", but I guess there is nothing I can do about that now. I can track pretty well all of the failed WU\'s to specific events in my overclocking adventures on my earlier E6400 or my current Q6600 setup. I do run at least 2 complete cycles of Memtest and 8 hours of Prime95 v25.x (i.e. all 4 cores) after any hardware changes and continuously monitor core temps but occasionally things still happen - e.g. temporarily attach a laptop SATA HD to copy some data and the machine won\'t boot, NB VERY hot; eventually have to replace MoBo; or Windoze decides it is corrupted after an AV update triggered reboot; etc. These events, and others similar, have required a re-load of or re-attach to Boinc and have resulted in the error reports or WU\'s that the system thinks I have, but my machine has lost them. I was getting quite excited, relatively speaking, at the prospect of completing a 160 year model - machine has been totally stable at 3.336GHz with core temps of 51C since Xmas - and having got past half way I was beginning to feel I was on the downhill. Then it errors and restarts from zero! Still, another 4 days should see me through the HADAM3 model I am currently crunching and I guess I should schedule a day out to re-run Memtest and Prime95 to be on the safe side (and blow out the dust bunnies while I am at it). F. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
When I\'m overclocking I use 24 hours of Prime95 (one per core). My Q6600 took a lot to get it completely stable, it had to go down a long way from being \'nearly stable\' to being \'entirely stable\'. The AMDs I\'ve overclocked were much easier, since the grey area between stable and unstable was much narrower. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 1 Feb 07 Posts: 26 Credit: 885,216 RAC: 0 |
Fair enough. I guess my machine is now booked for a thorough health check early next week. F. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Well I can forget about a happy ending for this WU. Due to upgrading Boinc to the latest version for Linux 5.10.45, it trashed all work on all projects and created 3 duplicate computers. A bit crappy for a latest release that has been tested. Tried restoring the folder but it still died, with 5.10.45 wiping all data in the Boinc folder no matter what I do. I tried rolling back to an older version of Boinc and then rolling forward again but it still wiped everything. Climate downloaded another WU but I aborted it as I am now very disheartened with the whole thing. I only had about 100 hours to go despite Boinc resetting my stats on that WU twice and I was looking forward to it finishing. I may be back but I don\'t know when, we will have to see. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
The whole question of BOINC downloads and upgrades for Linux is a mess as far as I can see. Not just the stability of the BOINC version, but the whole question of how to get it going on all the different distros. If you do crunch a climate model again, it might be worthwhile in future holding off upgrading BOINC until you\'ve completed the last model on the machine and have nothing or nearly nothing from other projects. I do sympathise. You actually have a very good model completion record on both computers so you\'ve contributed a lot to the project. Cpdn news |
Send message Joined: 3 Mar 06 Posts: 96 Credit: 353,185 RAC: 0 |
The whole question of BOINC downloads and upgrades for Linux is a mess as far as I can see. Not just the stability of the BOINC version, but the whole question of how to get it going on all the different distros. It\'s not as bad as it seems but it\'s definitely not as easy as with Windows. If there were people dedicated to the job of building BOINC install/update packages it would be great but the manpower, expertise or dedication seems to not be there. If you do crunch a climate model again, it might be worthwhile in future holding off upgrading BOINC until you\'ve completed the last model on the machine and have nothing or nearly nothing from other projects. Personally, I wouldn\'t dream of updating Linux or BOINC while a CPDN model is running. If it\'s absolutely necessary (and having the latest eye candy or strictly for convenience feature doesn\'t = necessary) then I make redundant backups and have a rock solid plan for rolling back the update if it causes problems. Also, I will not run CPDN parallel with other projects. If I start a model then it runs 24/7 to completion, no preempting by other projects because the sooner you get them done the fewer headaches you have. Then I put CPDN on the back burner for a few months and help other projects. Why take unnecessary risks that have zero payoff? |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
The whole question of BOINC downloads and upgrades for Linux is a mess as far as I can see. Not just the stability of the BOINC version, but the whole question of how to get it going on all the different distros. Thanks mo.v and Dagorath, I will keep this in mind if I decide to come back and have another go. |
©2024 cpdn.org