climateprediction.net (CPDN) home page
Thread 'HADAM3P - Maximum elapsed time exceeded'

Thread 'HADAM3P - Maximum elapsed time exceeded'

Message boards : Number crunching : HADAM3P - Maximum elapsed time exceeded
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 42711 - Posted: 29 Jul 2011, 20:08:16 UTC
Last modified: 29 Jul 2011, 20:25:46 UTC

This hadam3p_pnw_32s5_1985_1_007369346_0 with just over 231,000 seconds of run time.
The estimated time to completion had been way low ever since the task started -- about ten hours, slowly increasing as the task ran. 100 or 120 hours would seem more likely on this machine or the other host where there are still a couple of these running.

Is this a case where changing <rsc_fpops_bound> in client_state.xml might help?

Thanks

Eric

<<edit>>

Just got another hadap3p_pnw on the same machine, and time-to-completion looks normal at 111 hrs.
ID: 42711 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 42712 - Posted: 30 Jul 2011, 6:16:00 UTC
Last modified: 30 Jul 2011, 6:28:27 UTC

Eric,

The first batch of these tasks had run estimates about 1/10 of reality. Your second PNW task is closer to the mark.

Whether you tweak client_state is up to you, depending on whether the erroneous value interferes with other boinc projects. If none, it will sort itself out as the task moves along.

Edit:
I forgot, my bad. Please do make the tweak, else the task is likely to turn belly up about 80% completion.

Jim
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 42712 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42716 - Posted: 30 Jul 2011, 14:18:13 UTC
Last modified: 30 Jul 2011, 14:20:44 UTC

Dear astroWX

[Quote:] I forgot, my bad. Please do make the tweak, else the task is likely to turn belly up about 80% completion.

Please explain the tweak mentioned above. I also have a hadma3p_pnw that is ?waiting to start? with a ?to completion? time or 18 hours. You say that it is likely crash at about 80% unless it is modified.

I think I have found the place in the client_state.xml file that needs to be modified.

<rsc_fpops_bound> in client_state.xml

<name>hadam3p_pnw_314p_1995_1_007369937</name>
<app_name>hadam3p_pnw</app_name>
<version_num>609</version_num>
<rsc_fpops_est>79683833333333.000000</rsc_fpops_est>
<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>
<rsc_memory_bound>364000000.000000</rsc_memory_bound>
<rsc_disk_bound>2000000000.000000</rsc_disk_bound>

Is this right? Also please explain which value I have to change and to what.

I don?t want to waste 3 and a half days of crunching on defective WU only to have it crash that is fixable.
ID: 42716 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,498,085
RAC: 21,454
Message 42717 - Posted: 30 Jul 2011, 16:20:37 UTC - in response to Message 42716.  
Last modified: 30 Jul 2011, 16:22:36 UTC

<rsc_fpops_bound> in client_state.xml

<name>hadam3p_pnw_314p_1995_1_007369937</name>
<app_name>hadam3p_pnw</app_name>
<version_num>609</version_num>
<rsc_fpops_est>79683833333333.000000</rsc_fpops_est>
<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>
<rsc_memory_bound>364000000.000000</rsc_memory_bound>
<rsc_disk_bound>2000000000.000000</rsc_disk_bound>

Is this right? Also please explain which value I have to change and to what.

I don?t want to waste 3 and a half days of crunching on defective WU only to have it crash that is fixable.

You'll need to change <rsc_fpops_bound>.

Just make sure BOINC is stopped, open-up client_state.xml in Notepad (or something similar under Linux), and add another number to <rsc_fpops_bound> (before the decimal-point), save the new client_state.xml and re-start BOINC.

It doesn't matter if you also changes <rsc_fpops_bound> for other tasks, so it's possible to search & replace all occurrences of <rsc_fpops_bound> with <rsc_fpops_bound>9 or something (adding an extra 9 to all).

To not get very high duration correction factor, it's also an idea to change <rsc_fpops_est>, by adding an 8 or 9 at the start. This should only be done to the wrongly-estimated task(s).
ID: 42717 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42718 - Posted: 30 Jul 2011, 17:07:41 UTC - in response to Message 42716.  

Thanks for the info. Will try it tonight after I get home from work. Wish me luck (will make backup first).

Just to be sure that I understand, I should modify the entry from this:

<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>

To look like this?

<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>9

Is this correct? Please respond.

ID: 42718 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,498,085
RAC: 21,454
Message 42719 - Posted: 30 Jul 2011, 18:45:17 UTC - in response to Message 42718.  

Thanks for the info. Will try it tonight after I get home from work. Wish me luck (will make backup first).

Just to be sure that I understand, I should modify the entry from this:

<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>

To look like this?

<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>9

Is this correct? Please respond.

No, it should be from:
<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>
to
<rsc_fpops_bound>9796838333333330.000000</rsc_fpops_bound>


ID: 42719 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42720 - Posted: 30 Jul 2011, 18:58:20 UTC - in response to Message 42719.  

thanks for the quick reply. I would of had it backwards.
ID: 42720 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 42733 - Posted: 2 Aug 2011, 6:45:39 UTC - in response to Message 42720.  

This tweak worked ok for me.
The newer wu don't need it.

Keep crunching

Eric
ID: 42733 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42735 - Posted: 2 Aug 2011, 11:26:33 UTC

News post about a new problem.


Backups: Here
ID: 42735 · Report as offensive     Reply Quote
Darmok

Send message
Joined: 29 Dec 09
Posts: 34
Credit: 18,395,130
RAC: 0
Message 42738 - Posted: 2 Aug 2011, 11:28:58 UTC - in response to Message 42733.  
Last modified: 2 Aug 2011, 11:45:03 UTC

Just a thought: Wouldn't be easier to send a new client_state file? Based also on the post from Les in News & Annoucements,I am experiencing all these problems with a slew of pnw's set at a too low completion time, and my first completions that crashed at 89:38:18 exactly. I do not want to abort as it took several hours of download to receive them and it would be a waste.

As Les said, not everybody read the boards, not everybody is computer savvy and some have multiple hosts.
ID: 42738 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 42740 - Posted: 2 Aug 2011, 13:26:21 UTC - in response to Message 42738.  

BOINC infrastructure doesn't allow such changes. Client state file also can change every 5 sec.
ID: 42740 · Report as offensive     Reply Quote
Darmok

Send message
Joined: 29 Dec 09
Posts: 34
Credit: 18,395,130
RAC: 0
Message 42746 - Posted: 2 Aug 2011, 23:41:01 UTC - in response to Message 42740.  

You are right. The file reverts back after a restart.
ID: 42746 · Report as offensive     Reply Quote
ProfileOvertonesinger

Send message
Joined: 30 Dec 05
Posts: 5
Credit: 986,440
RAC: 0
Message 42758 - Posted: 5 Aug 2011, 21:24:48 UTC

Huge disappointment

The BIG minus of having 8 logical CPU-s is: NOT 1 or 2 ... but 8 WUs are DESTROYED at 68 percent complete (about 108 hours each) before I find out whats happening. Imagine how bad I feel. So much CPU-time wasted.

OK, just tell me here, please, when all new WUs are fixed - so their config is right. As a computer programmer I hate to change it manually to fix som-1-else s mistake... I hate to fear and having-to-CARE about every work-units LIFE. ;) lives of my newborn twins (kids) are just enough... Thanx.




*P.S.* Please, try to create 1 MultiThreaded (8-threaded preferably) WorkUnits, because when it fails - it fails only one WU, so I find it out much much sooner than 8x 108 hours of CPU time... , I would find it about eight times sooner. ;)

I loved the huge AQUA workUnits, MT 8 and they lasted about 25 hours. Sadly: the project has only too little 1-threaded TEST units lately (with LIMITED NUMBER of max. two at the same time per 1 computer) after some major crash. So I switched to CPDN hoping to put some 1 of 8 gigabytes of my RAM to some good use... how a PITY it crashed all 8 WUs. :(
please, fix it. I love CPDN and i have been running it on several computers for 2 years now... Please, fix it, as I have *too much* RAM and I cannot use it ALL. :)
ID: 42758 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42759 - Posted: 5 Aug 2011, 21:39:10 UTC - in response to Message 42758.  
Last modified: 5 Aug 2011, 21:41:34 UTC

I found out about the problem within hours, and notified the project people, who cancelled the remaining faulty WUs.
All WU's currently running/being created are, as far as is known, fault free.

cpdn isn't a set and forget project; keeping an eye on the message boards is needed if people don't want to get caught out with problems.


**************************

Multicore models were tested a couple of years back, but they were too unstable to even release to the beta testers. It's unlikely that the two project people will ever have the time to try again.
Backups: Here
ID: 42759 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42767 - Posted: 11 Aug 2011, 4:15:47 UTC

Several days ago, I downloaded three Hadam3p WU?s. The to completion time was listed as 1548 hours! I know that these to completion times are often significantly inflated, often 2 or 3 times the actual running time, but, this is about 10 times what it will take to run the WU?s.

Are other people are seeing this wild inflation or does it have something to do with the tweak that I made to the client_state file to fix a problem with an earlier WU? that WU had a very low to completion time of only 18 hours. That tweak is described above in this thread. Do I need to undo the tweak?


ID: 42767 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,827,799
RAC: 5,038
Message 42768 - Posted: 11 Aug 2011, 9:14:59 UTC

The HADCM3N time estimates were excessive, but the HADAM3P models haven't been reported as having systematic problems (other than 'fpops_bound'). I've never been quite clear how BOINC handles the transition between model types: it may be that if you've switched from mostly HADCM3N to HADAM3P then there would be some transient effect as BOINC adjusts - in which case the best thing to do is just wait.

If others have similarly inflated HADAM3P estimates for the new models then perhaps the HADCM3N values have been copied across ...
ID: 42768 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 42769 - Posted: 11 Aug 2011, 10:12:24 UTC - in response to Message 42768.  

Just gone back to a HADAM3P and time estimate seemed about right at the start but then I haven't done any manual editing of the client_state.xml file so can't comment on that.

Dave
ID: 42769 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 42771 - Posted: 12 Aug 2011, 0:11:52 UTC

Here -- one of my hosts got a new HADAM3P est at 1100+ hours, and the last few HADAM3P on this same host at startup estimated at 1500. Actual time is about 120 hrs. It all settles down after a while.
The strange bit is, that my other hosts never did this gross overestimate.
Maybe it's something in client_state.xml but I'm not going to worry it.
Too many BOINC options for me to sweat it.
If it aint broke --

Eric
ID: 42771 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42772 - Posted: 12 Aug 2011, 5:56:06 UTC

Dear Eric:

I know what you mean. My other host just got an hadam3P_pnw WU and the initial to completion time is only 265 hours. Hadam3p WU?s on that machine take about 175 hours to complete.

The only problem with these wildly inflated to completions times is that it is hard to get new work. The Boinc manager does not even ask for work when it thinks that you have a month and a half of crunching ahead of you, but, in reality the WU will finish in only a few days. This was fine when there were 20,000 WU?s waiting to be crunched, but, these days with the queue often empty (and the built-in back off times) the machine may have to beg for days just to get something.

ID: 42772 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,498,085
RAC: 21,454
Message 42775 - Posted: 12 Aug 2011, 11:40:46 UTC - in response to Message 42767.  
Last modified: 12 Aug 2011, 11:51:19 UTC

Several days ago, I downloaded three Hadam3p WU?s. The to completion time was listed as 1548 hours! I know that these to completion times are often significantly inflated, often 2 or 3 times the actual running time, but, this is about 10 times what it will take to run the WU?s.

Are other people are seeing this wild inflation or does it have something to do with the tweak that I made to the client_state file to fix a problem with an earlier WU? that WU had a very low to completion time of only 18 hours. That tweak is described above in this thread. Do I need to undo the tweak?

If you finished one of the broken Hadam3p-tasks by editing fpops_bound but didn't also increase fpops_est, your Duration Correction Factor (DCF for short) will increase accordingly. So, if example the initial estimate for the broken model was 10 hours, but the model actually took 100 hours to run, your DCF was increased 10x than before.

This new DCF will influence all future estimates, so with a new Hadam3p with "correct" fpops_est, instead of 100 hours it will show 1000 hours.

The DCF will slowly decrease again as you finish tasks, if not mis-remembers it decreases max 10% for each task, except if client thinks it's too large difference between current DCF and the lower one so only decreases with 1% per task... But in any case, it should slowly decrease again.


The DCF and therefore the estimates will never be very good here at CPDN, since for one thing HADCM3N is too high estimate, so after a string of these the DCF will become 0.5 or something, but a single Hadam3p will increase DCF back to 1 again. Also, some of the models the speed is significantly dependent on other things computer runs, if you runs multiple of the same model they can slow-down eachother, so even this will give some variations between runs.


Edit:

You can see your current DCF in BOINC Manager, as long as you're running v6.6.xx or later, by selecting the Project-tab, select a project, and hit "Properties". DCF is the last one listed.

The DCF is also displayed on the web-page, if you look on one of your own computers details you'll see the DCF.
ID: 42775 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : HADAM3P - Maximum elapsed time exceeded

©2024 cpdn.org