Thread 'HADAM3P - Maximum elapsed time exceeded'

Author	Message
Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 42711 - Posted: 29 Jul 2011, 20:08:16 UTC Last modified: 29 Jul 2011, 20:25:46 UTC This hadam3p_pnw_32s5_1985_1_007369346_0 with just over 231,000 seconds of run time. The estimated time to completion had been way low ever since the task started -- about ten hours, slowly increasing as the task ran. 100 or 120 hours would seem more likely on this machine or the other host where there are still a couple of these running. Is this a case where changing <rsc_fpops_bound> in client_state.xml might help? Thanks Eric <<edit>> Just got another hadap3p_pnw on the same machine, and time-to-completion looks normal at 111 hrs. ID: 42711 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 42712 - Posted: 30 Jul 2011, 6:16:00 UTC Last modified: 30 Jul 2011, 6:28:27 UTC Eric, The first batch of these tasks had run estimates about 1/10 of reality. Your second PNW task is closer to the mark. Whether you tweak client_state is up to you, depending on whether the erroneous value interferes with other boinc projects. If none, it will sort itself out as the task moves along. Edit: I forgot, my bad. Please do make the tweak, else the task is likely to turn belly up about 80% completion. Jim "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 42712 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 42716 - Posted: 30 Jul 2011, 14:18:13 UTC Last modified: 30 Jul 2011, 14:20:44 UTC Dear astroWX [Quote:] I forgot, my bad. Please do make the tweak, else the task is likely to turn belly up about 80% completion. Please explain the tweak mentioned above. I also have a hadma3p_pnw that is ?waiting to start? with a ?to completion? time or 18 hours. You say that it is likely crash at about 80% unless it is modified. I think I have found the place in the client_state.xml file that needs to be modified. <rsc_fpops_bound> in client_state.xml <name>hadam3p_pnw_314p_1995_1_007369937</name> <app_name>hadam3p_pnw</app_name> <version_num>609</version_num> <rsc_fpops_est>79683833333333.000000</rsc_fpops_est> <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound> <rsc_memory_bound>364000000.000000</rsc_memory_bound> <rsc_disk_bound>2000000000.000000</rsc_disk_bound> Is this right? Also please explain which value I have to change and to what. I don?t want to waste 3 and a half days of crunching on defective WU only to have it crash that is fixable. ID: 42716 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,498,085 RAC: 21,454	Message 42717 - Posted: 30 Jul 2011, 16:20:37 UTC - in response to Message 42716. Last modified: 30 Jul 2011, 16:22:36 UTC <rsc_fpops_bound> in client_state.xml <name>hadam3p_pnw_314p_1995_1_007369937</name> <app_name>hadam3p_pnw</app_name> <version_num>609</version_num> <rsc_fpops_est>79683833333333.000000</rsc_fpops_est> <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound> <rsc_memory_bound>364000000.000000</rsc_memory_bound> <rsc_disk_bound>2000000000.000000</rsc_disk_bound> Is this right? Also please explain which value I have to change and to what. I don?t want to waste 3 and a half days of crunching on defective WU only to have it crash that is fixable. You'll need to change <rsc_fpops_bound>. Just make sure BOINC is stopped, open-up client_state.xml in Notepad (or something similar under Linux), and add another number to <rsc_fpops_bound> (before the decimal-point), save the new client_state.xml and re-start BOINC. It doesn't matter if you also changes <rsc_fpops_bound> for other tasks, so it's possible to search & replace all occurrences of <rsc_fpops_bound> with <rsc_fpops_bound>9 or something (adding an extra 9 to all). To not get very high duration correction factor, it's also an idea to change <rsc_fpops_est>, by adding an 8 or 9 at the start. This should only be done to the wrongly-estimated task(s). ID: 42717 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 42718 - Posted: 30 Jul 2011, 17:07:41 UTC - in response to Message 42716. Thanks for the info. Will try it tonight after I get home from work. Wish me luck (will make backup first). Just to be sure that I understand, I should modify the entry from this: <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound> To look like this? <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>9 Is this correct? Please respond. ID: 42718 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,498,085 RAC: 21,454	Message 42719 - Posted: 30 Jul 2011, 18:45:17 UTC - in response to Message 42718. Thanks for the info. Will try it tonight after I get home from work. Wish me luck (will make backup first). Just to be sure that I understand, I should modify the entry from this: <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound> To look like this? <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>9 Is this correct? Please respond. No, it should be from: <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound> to <rsc_fpops_bound>9796838333333330.000000</rsc_fpops_bound> ID: 42719 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 42720 - Posted: 30 Jul 2011, 18:58:20 UTC - in response to Message 42719. thanks for the quick reply. I would of had it backwards. ID: 42720 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 42733 - Posted: 2 Aug 2011, 6:45:39 UTC - in response to Message 42720. This tweak worked ok for me. The newer wu don't need it. Keep crunching Eric ID: 42733 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42735 - Posted: 2 Aug 2011, 11:26:33 UTC News post about a new problem. Backups: Here ID: 42735 · Reply Quote

Darmok Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0	Message 42738 - Posted: 2 Aug 2011, 11:28:58 UTC - in response to Message 42733. Last modified: 2 Aug 2011, 11:45:03 UTC Just a thought: Wouldn't be easier to send a new client_state file? Based also on the post from Les in News & Annoucements,I am experiencing all these problems with a slew of pnw's set at a too low completion time, and my first completions that crashed at 89:38:18 exactly. I do not want to abort as it took several hours of download to receive them and it would be a waste. As Les said, not everybody read the boards, not everybody is computer savvy and some have multiple hosts. ID: 42738 · Reply Quote

DJStarfox Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370	Message 42740 - Posted: 2 Aug 2011, 13:26:21 UTC - in response to Message 42738. BOINC infrastructure doesn't allow such changes. Client state file also can change every 5 sec. ID: 42740 · Reply Quote

Darmok Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0	Message 42746 - Posted: 2 Aug 2011, 23:41:01 UTC - in response to Message 42740. You are right. The file reverts back after a restart. ID: 42746 · Reply Quote

Overtonesinger Send message Joined: 30 Dec 05 Posts: 5 Credit: 986,440 RAC: 0	Message 42758 - Posted: 5 Aug 2011, 21:24:48 UTC Huge disappointment The BIG minus of having 8 logical CPU-s is: NOT 1 or 2 ... but 8 WUs are DESTROYED at 68 percent complete (about 108 hours each) before I find out whats happening. Imagine how bad I feel. So much CPU-time wasted. OK, just tell me here, please, when all new WUs are fixed - so their config is right. As a computer programmer I hate to change it manually to fix som-1-else s mistake... I hate to fear and having-to-CARE about every work-units LIFE. ;) lives of my newborn twins (kids) are just enough... Thanx. P.S. Please, try to create 1 MultiThreaded (8-threaded preferably) WorkUnits, because when it fails - it fails only one WU, so I find it out much much sooner than 8x 108 hours of CPU time... , I would find it about eight times sooner. ;) I loved the huge AQUA workUnits, MT 8 and they lasted about 25 hours. Sadly: the project has only too little 1-threaded TEST units lately (with LIMITED NUMBER of max. two at the same time per 1 computer) after some major crash. So I switched to CPDN hoping to put some 1 of 8 gigabytes of my RAM to some good use... how a PITY it crashed all 8 WUs. :( please, fix it. I love CPDN and i have been running it on several computers for 2 years now... Please, fix it, as I have too much RAM and I cannot use it ALL. :) ID: 42758 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42759 - Posted: 5 Aug 2011, 21:39:10 UTC - in response to Message 42758. Last modified: 5 Aug 2011, 21:41:34 UTC I found out about the problem within hours, and notified the project people, who cancelled the remaining faulty WUs. All WU's currently running/being created are, as far as is known, fault free. cpdn isn't a set and forget project; keeping an eye on the message boards is needed if people don't want to get caught out with problems. ************************** Multicore models were tested a couple of years back, but they were too unstable to even release to the beta testers. It's unlikely that the two project people will ever have the time to try again. Backups: Here ID: 42759 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 42767 - Posted: 11 Aug 2011, 4:15:47 UTC Several days ago, I downloaded three Hadam3p WU?s. The to completion time was listed as 1548 hours! I know that these to completion times are often significantly inflated, often 2 or 3 times the actual running time, but, this is about 10 times what it will take to run the WU?s. Are other people are seeing this wild inflation or does it have something to do with the tweak that I made to the client_state file to fix a problem with an earlier WU? that WU had a very low to completion time of only 18 hours. That tweak is described above in this thread. Do I need to undo the tweak? ID: 42767 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038	Message 42768 - Posted: 11 Aug 2011, 9:14:59 UTC The HADCM3N time estimates were excessive, but the HADAM3P models haven't been reported as having systematic problems (other than 'fpops_bound'). I've never been quite clear how BOINC handles the transition between model types: it may be that if you've switched from mostly HADCM3N to HADAM3P then there would be some transient effect as BOINC adjusts - in which case the best thing to do is just wait. If others have similarly inflated HADAM3P estimates for the new models then perhaps the HADCM3N values have been copied across ... ID: 42768 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 42769 - Posted: 11 Aug 2011, 10:12:24 UTC - in response to Message 42768. Just gone back to a HADAM3P and time estimate seemed about right at the start but then I haven't done any manual editing of the client_state.xml file so can't comment on that. Dave ID: 42769 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 42771 - Posted: 12 Aug 2011, 0:11:52 UTC Here -- one of my hosts got a new HADAM3P est at 1100+ hours, and the last few HADAM3P on this same host at startup estimated at 1500. Actual time is about 120 hrs. It all settles down after a while. The strange bit is, that my other hosts never did this gross overestimate. Maybe it's something in client_state.xml but I'm not going to worry it. Too many BOINC options for me to sweat it. If it aint broke -- Eric ID: 42771 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 42772 - Posted: 12 Aug 2011, 5:56:06 UTC Dear Eric: I know what you mean. My other host just got an hadam3P_pnw WU and the initial to completion time is only 265 hours. Hadam3p WU?s on that machine take about 175 hours to complete. The only problem with these wildly inflated to completions times is that it is hard to get new work. The Boinc manager does not even ask for work when it thinks that you have a month and a half of crunching ahead of you, but, in reality the WU will finish in only a few days. This was fine when there were 20,000 WU?s waiting to be crunched, but, these days with the queue often empty (and the built-in back off times) the machine may have to beg for days just to get something. ID: 42772 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,498,085 RAC: 21,454	Message 42775 - Posted: 12 Aug 2011, 11:40:46 UTC - in response to Message 42767. Last modified: 12 Aug 2011, 11:51:19 UTC Several days ago, I downloaded three Hadam3p WU?s. The to completion time was listed as 1548 hours! I know that these to completion times are often significantly inflated, often 2 or 3 times the actual running time, but, this is about 10 times what it will take to run the WU?s. Are other people are seeing this wild inflation or does it have something to do with the tweak that I made to the client_state file to fix a problem with an earlier WU? that WU had a very low to completion time of only 18 hours. That tweak is described above in this thread. Do I need to undo the tweak? If you finished one of the broken Hadam3p-tasks by editing fpops_bound but didn't also increase fpops_est, your Duration Correction Factor (DCF for short) will increase accordingly. So, if example the initial estimate for the broken model was 10 hours, but the model actually took 100 hours to run, your DCF was increased 10x than before. This new DCF will influence all future estimates, so with a new Hadam3p with "correct" fpops_est, instead of 100 hours it will show 1000 hours. The DCF will slowly decrease again as you finish tasks, if not mis-remembers it decreases max 10% for each task, except if client thinks it's too large difference between current DCF and the lower one so only decreases with 1% per task... But in any case, it should slowly decrease again. The DCF and therefore the estimates will never be very good here at CPDN, since for one thing HADCM3N is too high estimate, so after a string of these the DCF will become 0.5 or something, but a single Hadam3p will increase DCF back to 1 again. Also, some of the models the speed is significantly dependent on other things computer runs, if you runs multiple of the same model they can slow-down eachother, so even this will give some variations between runs. Edit: You can see your current DCF in BOINC Manager, as long as you're running v6.6.xx or later, by selecting the Project-tab, select a project, and hit "Properties". DCF is the last one listed. The DCF is also displayed on the web-page, if you look on one of your own computers details you'll see the DCF. ID: 42775 · Reply Quote