Message boards : Number crunching : HADAM3P - Maximum elapsed time exceeded
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
This hadam3p_pnw_32s5_1985_1_007369346_0 with just over 231,000 seconds of run time. The estimated time to completion had been way low ever since the task started -- about ten hours, slowly increasing as the task ran. 100 or 120 hours would seem more likely on this machine or the other host where there are still a couple of these running. Is this a case where changing <rsc_fpops_bound> in client_state.xml might help? Thanks Eric <<edit>> Just got another hadap3p_pnw on the same machine, and time-to-completion looks normal at 111 hrs. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Eric, The first batch of these tasks had run estimates about 1/10 of reality. Your second PNW task is closer to the mark. Whether you tweak client_state is up to you, depending on whether the erroneous value interferes with other boinc projects. If none, it will sort itself out as the task moves along. Edit: I forgot, my bad. Please do make the tweak, else the task is likely to turn belly up about 80% completion. Jim "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Dear astroWX [Quote:] I forgot, my bad. Please do make the tweak, else the task is likely to turn belly up about 80% completion. Please explain the tweak mentioned above. I also have a hadma3p_pnw that is ?waiting to start? with a ?to completion? time or 18 hours. You say that it is likely crash at about 80% unless it is modified. I think I have found the place in the client_state.xml file that needs to be modified. <rsc_fpops_bound> in client_state.xml <name>hadam3p_pnw_314p_1995_1_007369937</name> <app_name>hadam3p_pnw</app_name> <version_num>609</version_num> <rsc_fpops_est>79683833333333.000000</rsc_fpops_est> <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound> <rsc_memory_bound>364000000.000000</rsc_memory_bound> <rsc_disk_bound>2000000000.000000</rsc_disk_bound> Is this right? Also please explain which value I have to change and to what. I don?t want to waste 3 and a half days of crunching on defective WU only to have it crash that is fixable. |
Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,517,986 RAC: 17,587 |
<rsc_fpops_bound> in client_state.xml You'll need to change <rsc_fpops_bound>. Just make sure BOINC is stopped, open-up client_state.xml in Notepad (or something similar under Linux), and add another number to <rsc_fpops_bound> (before the decimal-point), save the new client_state.xml and re-start BOINC. It doesn't matter if you also changes <rsc_fpops_bound> for other tasks, so it's possible to search & replace all occurrences of <rsc_fpops_bound> with <rsc_fpops_bound>9 or something (adding an extra 9 to all). To not get very high duration correction factor, it's also an idea to change <rsc_fpops_est>, by adding an 8 or 9 at the start. This should only be done to the wrongly-estimated task(s). |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Thanks for the info. Will try it tonight after I get home from work. Wish me luck (will make backup first). Just to be sure that I understand, I should modify the entry from this: <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound> To look like this? <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>9 Is this correct? Please respond. |
Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,517,986 RAC: 17,587 |
Thanks for the info. Will try it tonight after I get home from work. Wish me luck (will make backup first). No, it should be from: <rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound> to <rsc_fpops_bound>9796838333333330.000000</rsc_fpops_bound> |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
thanks for the quick reply. I would of had it backwards. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
This tweak worked ok for me. The newer wu don't need it. Keep crunching Eric |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0 |
Just a thought: Wouldn't be easier to send a new client_state file? Based also on the post from Les in News & Annoucements,I am experiencing all these problems with a slew of pnw's set at a too low completion time, and my first completions that crashed at 89:38:18 exactly. I do not want to abort as it took several hours of download to receive them and it would be a waste. As Les said, not everybody read the boards, not everybody is computer savvy and some have multiple hosts. |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
BOINC infrastructure doesn't allow such changes. Client state file also can change every 5 sec. |
Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0 |
You are right. The file reverts back after a restart. |
Send message Joined: 30 Dec 05 Posts: 5 Credit: 986,440 RAC: 0 |
Huge disappointment The BIG minus of having 8 logical CPU-s is: NOT 1 or 2 ... but 8 WUs are DESTROYED at 68 percent complete (about 108 hours each) before I find out whats happening. Imagine how bad I feel. So much CPU-time wasted. OK, just tell me here, please, when all new WUs are fixed - so their config is right. As a computer programmer I hate to change it manually to fix som-1-else s mistake... I hate to fear and having-to-CARE about every work-units LIFE. ;) lives of my newborn twins (kids) are just enough... Thanx. *P.S.* Please, try to create 1 MultiThreaded (8-threaded preferably) WorkUnits, because when it fails - it fails only one WU, so I find it out much much sooner than 8x 108 hours of CPU time... , I would find it about eight times sooner. ;) I loved the huge AQUA workUnits, MT 8 and they lasted about 25 hours. Sadly: the project has only too little 1-threaded TEST units lately (with LIMITED NUMBER of max. two at the same time per 1 computer) after some major crash. So I switched to CPDN hoping to put some 1 of 8 gigabytes of my RAM to some good use... how a PITY it crashed all 8 WUs. :( please, fix it. I love CPDN and i have been running it on several computers for 2 years now... Please, fix it, as I have *too much* RAM and I cannot use it ALL. :) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I found out about the problem within hours, and notified the project people, who cancelled the remaining faulty WUs. All WU's currently running/being created are, as far as is known, fault free. cpdn isn't a set and forget project; keeping an eye on the message boards is needed if people don't want to get caught out with problems. ************************** Multicore models were tested a couple of years back, but they were too unstable to even release to the beta testers. It's unlikely that the two project people will ever have the time to try again. Backups: Here |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Several days ago, I downloaded three Hadam3p WU?s. The to completion time was listed as 1548 hours! I know that these to completion times are often significantly inflated, often 2 or 3 times the actual running time, but, this is about 10 times what it will take to run the WU?s. Are other people are seeing this wild inflation or does it have something to do with the tweak that I made to the client_state file to fix a problem with an earlier WU? that WU had a very low to completion time of only 18 hours. That tweak is described above in this thread. Do I need to undo the tweak? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,842,730 RAC: 5,006 |
The HADCM3N time estimates were excessive, but the HADAM3P models haven't been reported as having systematic problems (other than 'fpops_bound'). I've never been quite clear how BOINC handles the transition between model types: it may be that if you've switched from mostly HADCM3N to HADAM3P then there would be some transient effect as BOINC adjusts - in which case the best thing to do is just wait. If others have similarly inflated HADAM3P estimates for the new models then perhaps the HADCM3N values have been copied across ... |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Just gone back to a HADAM3P and time estimate seemed about right at the start but then I haven't done any manual editing of the client_state.xml file so can't comment on that. Dave |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Here -- one of my hosts got a new HADAM3P est at 1100+ hours, and the last few HADAM3P on this same host at startup estimated at 1500. Actual time is about 120 hrs. It all settles down after a while. The strange bit is, that my other hosts never did this gross overestimate. Maybe it's something in client_state.xml but I'm not going to worry it. Too many BOINC options for me to sweat it. If it aint broke -- Eric |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Dear Eric: I know what you mean. My other host just got an hadam3P_pnw WU and the initial to completion time is only 265 hours. Hadam3p WU?s on that machine take about 175 hours to complete. The only problem with these wildly inflated to completions times is that it is hard to get new work. The Boinc manager does not even ask for work when it thinks that you have a month and a half of crunching ahead of you, but, in reality the WU will finish in only a few days. This was fine when there were 20,000 WU?s waiting to be crunched, but, these days with the queue often empty (and the built-in back off times) the machine may have to beg for days just to get something. |
Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,517,986 RAC: 17,587 |
Several days ago, I downloaded three Hadam3p WU?s. The to completion time was listed as 1548 hours! I know that these to completion times are often significantly inflated, often 2 or 3 times the actual running time, but, this is about 10 times what it will take to run the WU?s. If you finished one of the broken Hadam3p-tasks by editing fpops_bound but didn't also increase fpops_est, your Duration Correction Factor (DCF for short) will increase accordingly. So, if example the initial estimate for the broken model was 10 hours, but the model actually took 100 hours to run, your DCF was increased 10x than before. This new DCF will influence all future estimates, so with a new Hadam3p with "correct" fpops_est, instead of 100 hours it will show 1000 hours. The DCF will slowly decrease again as you finish tasks, if not mis-remembers it decreases max 10% for each task, except if client thinks it's too large difference between current DCF and the lower one so only decreases with 1% per task... But in any case, it should slowly decrease again. The DCF and therefore the estimates will never be very good here at CPDN, since for one thing HADCM3N is too high estimate, so after a string of these the DCF will become 0.5 or something, but a single Hadam3p will increase DCF back to 1 again. Also, some of the models the speed is significantly dependent on other things computer runs, if you runs multiple of the same model they can slow-down eachother, so even this will give some variations between runs. Edit: You can see your current DCF in BOINC Manager, as long as you're running v6.6.xx or later, by selecting the Project-tab, select a project, and hit "Properties". DCF is the last one listed. The DCF is also displayed on the web-page, if you look on one of your own computers details you'll see the DCF. |
©2024 cpdn.org