Message boards : Number crunching : Work done reverted back to Zero
Message board moderation
Author | Message |
---|---|
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
I have just had one work unit with 68% completed and over 1700 hours done just go back to 0.095% and Zero hours done (according to the Boinc Manager), I will keep fingers crossed that it sorts itself out as I did not back it up (I have not lost credit on it though). The work unit is a \'hadcm3inct_cn6q\' type. I restarted the manager and it started running from the 0.00% mark done, still shows 68.65% completed and now has 217 hours to go. It was showing over 1700 hours done with 700 odd hours to go. Maybe I should reboot the machine? No other work units are affected (I have 3 other CPDN, Cosmology, Rosetta, Einstein and The Lattice Project all running at the moment and none of them have done this). Boinc Manager is currently showing 4 hours done and 216 hours to go with the 68.65% still there. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
You mean this model? http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6602199 It\'s trickled again a few hours ago so must have recovered from whatever happened and got back from 0% to 68% again. And the only thing you did to help it was exit from boinc and restart the manager? But just look at the speed of the latest trickle!!! You\'ll need to keep an eye on what its next trickle shows. I\'ve no idea what happened here or how the model recovered. Conan, is there any way you could reserve one computer for cpdn only and keep your other project WUs on the other machine? This would make it worth while to make backups of the cpdn boinc folder on just one computer. I know that your credits earned are never at risk from a model crash, but making backups maximises the chance of all the models completing. Cpdn news |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
I\'ve seen a few machines (all Linux) where the seconds/per timestep cycled in a sawtooth fashion through unrealistic values. The owner of one of these machines told me that he thought the clock on his machine was broken. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
No mo.v I am unable to do that as I now only have the one machine physically near me (my only other active machine is 9 hours away and does not run CPDN). It seems to be a Boinc Manager thing as I now have noticed that for the second time one of my 4 cores has shut down and only three keep working. No error messages, restarting the manager gets all 4 going again. I updated to 5.10.8 recently but this may of been a bad move. As it was 3 months old I thought it might be stable but perhaps it is not? The WU stats seems to show the Boinc Manager figures for CPU time and timestep/second but the CPDN servers are showing the correct number of timesteps. The Percent done is staying correct (possibly due to the CPDN Servers) but the other figures are screwed up. |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
You mean this model? That speed of 0.0079 sec/TS is crazy. I bet BOINC Mgr lost track of the model and reset its counters. But the model kept its checkpoint files OK. So when the model restarted, it updated the BOINC state file, moving the counters back to where they were. Sounds like a problem with BOINC, not CPDN. I\'d say very good luck that the model recovered OK. I expect the next trickle to be back to normal speed. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Do you have \'keep in memory\' set to yes or no? If no, then turning it on may help. Sometimes if a model goes out of memory at the wrong point, it resets it. Have you tried running a stability check on the PC for 24 hours or so? (mprime is the linux version of Prime95). Note that you\'ll need to run 4 copies, one for each core. When you say \'one of the cores shuts down\' what does this mean? Does the job still appear in PS / Top ? What is it\'s status if it does? What does the boinc manager show for that task? I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
What I mean is that instead of 4 jobs processing at the same time one stops and only 3 are going. 3 cpus are running at 100% and 1 is at 0%. Although this has happened twice in the last 2 days, I think this time it may of been related to Rosetta locking up as I set my preferences to run for 21000 seconds (6 hours) but the last job went for over 28000 seconds (nearly 8 hours). So Boinc 5.10.8 or Rosetta may have problems. CPDN is still running fine (I have 4 jobs going at once), just one job has it\'s data a bit incorrect. I will let it run and see what the end result may be, as I have not lost the WU yet and the percent done and credits earned are still intact. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I\'d recommend doing the stability test even if the PC has been stable for a long time. There are several things which can affect your PC, such as dust on fans and heatsinks, fan bearings wearing, power supplies wearing out, and so forth. What was the status of the 0% job in PS / top? i.e., perhaps D, or S. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Sorry Mike, but didn\'t check it at the time. I just noticed that only 3 jobs were processing not 4 and instead of investigating just restarted the manager. |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
If you have \"keep in memory\" = no, then Rosetta *WILL* have problems in Linux when BOINC suspends it. Rosetta app just is broken with this feature. Although CPDN is much more stable than Rosetta, you will still lose progress since the last checkpoint when BOINC suspends the app. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
This is the section of my Slot folder relating to the problem WU Resuming CPDN! hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2836081 A - 01/05/2030 00:30 - H:M:S=1757:12:33 AVG= 2.23 DLT= 1.00 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2836513 A - 07/05/2030 00:30 - H:M:S=1757:26:17 AVG= 2.23 DLT= 1.00 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2836945 A - 13/05/2030 00:30 - H:M:S=1757:40:13 AVG= 2.23 DLT= 0.99 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2837377 A - 19/05/2030 00:30 - H:M:S=1757:53:55 AVG= 2.23 DLT= 1.00 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2837809 A - 25/05/2030 00:30 - H:M:S=1758:07:32 AVG= 2.23 DLT= 1.00 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2838241 A - 01/06/2030 00:30 - H:M:S=1758:21:28 AVG= 2.23 DLT= 0.99 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2838673 A - 07/06/2030 00:30 - H:M:S=1758:35:21 AVG= 2.23 DLT= 1.00 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2839105 A - 13/06/2030 00:30 - H:M:S=1758:49:20 AVG= 2.23 DLT= 0.99 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2839537 A - 19/06/2030 00:30 - H:M:S=1759:03:21 AVG= 2.23 DLT= 1.00 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2839969 A - 25/06/2030 00:30 - H:M:S=1759:17:09 AVG= 2.23 DLT= 0.99 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2840401 A - 01/07/2030 00:30 - H:M:S=1759:31:08 AVG= 2.23 DLT= 1.00 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2840833 A - 07/07/2030 00:30 - H:M:S=1759:46:09 AVG= 2.23 DLT= 2.00 Suspended CPDN Monitor - Quit request from BOINC... Cleaning up graphics data... Detaching shared memory... shmget: No such file or directory Beginning work on result hadcm3inct_cn6q_1920_160_45870254_4... Starting model in /home/ggoninan/BOINC/projects/climateprediction.net... Created shared memory region key = 172920 of size 655060 bytes (version 602) Sorry, BOINC could not open shared graphics library! Starting model ID hadcm3inct_cn6q_1920_160_45870254 Phase 1 Getting pthread attributes - retval=0 Setting pthread size (100663296 bytes) - retval=0 Executing program hadcm3transum_5.44_i686-pc-linux-gnu 172920 Program launched with process id # 785 Climate model starting - use graphics to monitor progress. Or visit the website to see the graphs for this run. hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2840833 A - 07/07/2030 00:30 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00 scan: cpdnout11.zip scan: init_data.xml scan: boinc_ufs_cpdnout9.zip scan: ozone_hadcm3_1900.gz scan: DMSallNH3SO21900.gz scan: cpdnout9.zip scan: cpdnout13.zip scan: cpdnout15.zip scan: boinc_ufs_cpdnout10.zip scan: volc_v00.gz scan: cpdnout5.zip scan: cpdnout16.zip scan: boinc_ufs_cpdnout8.zip scan: hadcm3trans_5.41_i686-pc-linux-gnu scan: hadcm3transse_5.41_i686-pc-linux-gnu.zip scan: stderr.txt scan: ghg_cntrl.gz scan: SULPC_OXIDANTS_19_A2_1990.mod.gz scan: hadcm3trans_5.41_i686-pc-linux-gnu.so scan: hadcm3transdata_5.41_i686-pc-linux-gnu.zip scan: cpdnout4.zip scan: spec3a_sw_3_asol2b_hadcm3.gz scan: boinc_ufs_cpdnout2.zip scan: hadcm3inct_cn6q_1920_160_45870254.zip scan: cpdnout7.zip scan: NAT_VOLC.gz scan: yafbg.astart.gz scan: cpdnout3.zip scan: spec3a_lw_3_asol2c_hadcm3.gz scan: cpdnout14.zip scan: boinc_ufs_cpdnout3.zip scan: boinc_ufs_cpdnout5.zip scan: boinc_ufs_cpdnout1.zip scan: SULPC_OXIDANTS_19_A2_1990.gz scan: 1002_flux_corr.anc.gz scan: cpdnout2.zip scan: cpdnout6.zip scan: 1002_ocean.year.gz scan: boinc_ufs_cpdnout6.zip scan: cpdnout1.zip scan: cpdnout8.zip scan: cpdnout10.zip scan: solar_v00.gz scan: boinc_lockfile scan: boinc_ufs_cpdnout7.zip scan: hadcm3transum_5.41_i686-pc-linux-gnu scan: cpdnout12.zip scan: hfh3hdck_0308_nickfluxcorr.anc.gz scan: boinc_ufs_cpdnout4.zip hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2841265 A - 13/07/2030 00:30 - H:M:S=0000:18:17 AVG= 0.00 DLT= 1.00 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2841697 A - 19/07/2030 00:30 - H:M:S=0000:36:44 AVG= 0.00 DLT= 0.98 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2842129 A - 25/07/2030 00:30 - H:M:S=0000:55:06 AVG= 0.00 DLT= 0.98 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2842561 A - 01/08/2030 00:30 - H:M:S=0001:13:41 AVG= 0.00 DLT= 1.99 hadcm3inct_cn6q_1920_160_45870254 - PH 1 TS 2842993 A - 07/08/2030 00:30 - H:M:S=0001:29:49 AVG= 0.00 DLT= 1.00 You can see how the Boinc Manager stats all reverted back to Zero, but the WU kept on going as if nothing had happened. Mo.V, Mark would you know of the reason for this? The WU still going along ok (I took a screen shot but have no idea how to transfer the image to this forum via Linux, tried a number of programmes but none work. If anyone knows please let me know). So Boinc Manager is showing CPU TIME = 42:31:17 PROGRESS = 70.410% TO COMPLETION = 204:06:11 The CPU time should be showing well over 1,800 hours not 42. Progress has moved on from the 68% in the first post to over 70% now. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I\'ve never before seen anything resembling this. It could be that the only thing that reset itself to zero is the CPU time. On the web page for this model http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6602199 I see that the sec/TS are now being calculated as if you\'d reached this point in the model after only about 3 days of computing ie since the CPU time reset. The extract from the slots folder seems to me to show the correct increments between checkpoints. From the point of view of the credits you\'ll get, this makes no difference because you receive a fixed number for every trickle you produce. I can\'t think of anything you can do about this CPU time reset, or that would be worth attempting. These incorrect sec/TS speeds now showing make no difference to the calculations for the model itself, nor do the CPU time elapsed or time to completion. I think it\'s not long since the model passed a 10-year zip file upload point. Did you notice whether the file was produced and uploaded? I still think it would be a good idea for this computer to run the stability test Mike mentioned. If the computer comes through the test perfectly on all cores, you\'d have total confidence in it again. The alternative is to wait and see whether the same thing happens again. When a core just shuts itself down, the likelihood of its model crashing is probably about 1 in 20. My old computer shuts itself down with increasing frequency, though it continues running electrically. It\'s a hardware problem - it\'s the CPU. Cpdn news |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Simply amazing and strange. My only guess at this point is a flaky BOINC version. I think I tried to update to 5.10.8, 5.10.10, and 5.10.20 all at different points, but the programs had issues running on FC7. The manager and the daemon had communication/launching/terminating problems. You may have discovered a bug. BTW, are you running the 32-bit or 64-bit version of BOINC 5.10.8? |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Simply amazing and strange. My only guess at this point is a flaky BOINC version. I think I tried to update to 5.10.8, 5.10.10, and 5.10.20 all at different points, but the programs had issues running on FC7. The manager and the daemon had communication/launching/terminating problems. You may have discovered a bug. The 32-bit and FC3. If it is a bug I don\'t want it. To get a running cr/h on this project I convert the number of processed seconds into hours and divide that into the amount of credit granted. I now will have trouble doing that as the seconds counter has restarted from zero, so I will have to estimate the time on the last normal sec/TS and work from that till it finishes. I only have 29% left to go as it has moved onto 71% done now. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
My 2+ day old Q6600 G0 has this problem in Fedora7, too -- no graphics:
This is my first encounter with Fedora and I wondered whether there is some hitch with dropping FC and merging the lot into one piece in Fedora7. Conan\'s issue in FC3 sets that thought to rest. Tried ati\'s latest Linux driver for graphics card: No joy. Haven\'t looked into it further because, as soon as openSuSE 10.3 is available (supposedly early October), F7 gets unceremoniously dumped. (It was a stopgap experiment -- because I didn\'t want to make another openSuSE 10.2 installation, with it\'s benighted ZENworks.) If the issue persists... In the interim, I\'m using checkpoint information from run_client messages for timing backups (not an issue with the short Slab intervals, but worth the bother with a pair of SAP runs in the mix). Which shared graphics library? If anyone has a thought about this... "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
The 32-bit and FC3. I ran 5.8.16 on FC4 without a problem, and I run FC7 now with the same version (32-bit BOINC both systems). That\'s what I recommend you stick with, especially for CPDN. At least I know that version works. I\'m running 1 coupled and 1 slab model now without any problems.... |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
My 2+ day old Q6600 G0 has this problem in Fedora7, too -- no graphics: Astro, Would you run this command for me at a terminal window (and post the output)? rpm -qa|grep xorg-x11-d|sort Also, what is the model of your ATI card? AGP or PCI or PCI-express? |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Wilco, in the morning. Meanwhile, the Graphics card is Gigabyte ati Radeon HD 2400 Pro. It\'s PCI-E. (I go with low-end graphics cards because I have zero interest in games. [Computer, or otherwise.] This is by far the most \'capable\' card in any of my six CPDN boxes -- the others, all ati Radeon, being four 300\'s and one 1300.) This card was chosen because it claims to be \'Vista ready\' --> whatever that means. (Against my better judgment, I decided to put Vista on the Q6600 because I prefer my boxes to be dual-boot. I should have gone for another XP license!) Off to bed... G\'night all. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Well, I was overly optimistic about getting to it in the morning! [jim@localhost ~]$ rpm -qa|grep xorg-x11-d|sort xorg-x11-drivers-7.2-6.fc7 xorg-x11-drv-acecad-1.1.0-3.fc7 xorg-x11-drv-aiptek-1.0.1-3.fc7 xorg-x11-drv-apm-1.1.1-3.fc7 xorg-x11-drv-ark-0.6.0-3.fc7 xorg-x11-drv-ast-0.81.0-4.fc7 xorg-x11-drv-ati-6.6.3-4.fc7 xorg-x11-drv-calcomp-1.1.0-2.fc7 xorg-x11-drv-chips-1.1.1-3.fc7 xorg-x11-drv-cirrus-1.1.0-3.fc7 xorg-x11-drv-citron-2.2.0-2.fc7 xorg-x11-drv-digitaledge-1.1.0-2.fc7 xorg-x11-drv-dmc-1.1.0-3.fc7 xorg-x11-drv-dummy-0.2.0-3.fc7 xorg-x11-drv-dynapro-1.1.0-3.fc7 xorg-x11-drv-elographics-1.1.0-2.fc7 xorg-x11-drv-evdev-1.1.2-3.fc7 xorg-x11-drv-fbdev-0.3.1-2.fc7 xorg-x11-drv-fpit-1.1.0-2.fc7 xorg-x11-drv-glint-1.1.1-5.fc7 xorg-x11-drv-hyperpen-1.1.0-3.fc7 xorg-x11-drv-i128-1.2.0-5.fc7 xorg-x11-drv-i740-1.1.0-3.fc7 xorg-x11-drv-i810-2.0.0-4.fc7 xorg-x11-drv-jamstudio-1.1.0-2.fc7 xorg-x11-drv-keyboard-1.1.0-3.fc7 xorg-x11-drv-magellan-1.1.0-2.fc7 xorg-x11-drv-magictouch-1.0.0.5-3.fc7 xorg-x11-drv-mga-1.4.6.1-3.fc7 xorg-x11-drv-microtouch-1.1.0-2.fc7 xorg-x11-drv-mouse-1.2.1-2.fc7 xorg-x11-drv-mutouch-1.1.0-3.fc7 xorg-x11-drv-nouveau-2.0.96-2.fc7 xorg-x11-drv-nv-2.0.96-2.fc7 xorg-x11-drv-palmax-1.1.0-2.fc7 xorg-x11-drv-penmount-1.1.0-3.fc7 xorg-x11-drv-rendition-4.1.3-3.fc7 xorg-x11-drv-s3-0.5.0-3.fc7 xorg-x11-drv-s3virge-1.9.1-3.fc7 xorg-x11-drv-savage-2.1.2-3.fc7 xorg-x11-drv-siliconmotion-1.5.1-1.fc7 xorg-x11-drv-sis-0.9.3-2.fc7 xorg-x11-drv-sisusb-0.8.1-5.fc7 xorg-x11-drv-spaceorb-1.1.0-2.fc7 xorg-x11-drv-summa-1.1.0-2.fc7 xorg-x11-drv-tdfx-1.3.0-4.fc7 xorg-x11-drv-tek4957-1.1.0-2.fc7 xorg-x11-drv-trident-1.2.3-4.fc7 xorg-x11-drv-tseng-1.1.0-5.fc7 xorg-x11-drv-ur98-1.1.0-2.fc7 xorg-x11-drv-v4l-0.1.1-8.fc7 xorg-x11-drv-vesa-1.3.0-8.fc7 xorg-x11-drv-via-0.2.2-1.fc7 xorg-x11-drv-vmmouse-12.4.0-2.1 xorg-x11-drv-vmware-10.14.1-1.fc7 xorg-x11-drv-void-1.1.0-4.fc7 xorg-x11-drv-voodoo-1.1.0-4.fc7 "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Astro, I think you have all the RPM\'s. AMD *just* released a new driver that according to this article: http://www.phoronix.com/scan.php?page=article&item=821&num=1 ...should give you OpenGL support (for the first time) for linux. Here\'s the driver links for 32 and 64 bit Linux. Choose wisely: http://ati.amd.com/support/drivers/linux/linux-radeonhdd.html http://ati.amd.com/support/drivers/linux64/linux64-radeonhdd.html I recommend giving the new driver a try. |
©2024 cpdn.org