Message boards : Number crunching : Name BOINC mis-estimates runtime for hadam3pm2_k00w
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
There has been a smallish batch of these -- alias MOSES that came out last few days. What BOINC reports about estimated completion time is obviously totally wrong - either one day of fulltime computing is 1% -- meaning that completing the wu would take 100 days, or - if there's only 10 uploads for these old MOSES things it wuold only be 10-20 days. The problem is - I got flooded with these wu's that, based on the initial estimate of a day or 3, now look like they might run for a half year. IF - they are like the long ago MOSES models -- ?? They probably run for 10 uploads -- that would give an estimate of only a week or two. What to think? I've choked down my acceptance factor from 6 days to 1, because I've no clue how long these things will run. Any ideas? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
You can probably work it out by now based on percentage competed. :) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Yes, bad estimate. This is still set for either 1 or 2 years, whatever was used in testing. So the total time is either 5 or 10 times the figure used. Got one on each of my machines. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Naah, the percent-completed in boincmgr is totally out-of-range. The 3 uploads in 3 days - yeah if there's 10 uploads to completion - that gives a rough estimate. So what I figure is maybe 300-500 hours on my fastest machine. Give or take a week or so. So that's Okay. seems like in reality - but NOT in BOINC estimate -- These wu's will comlpete - we hope - in a reasonable week or two on recent machines. Very good. So -- anybody looking at this thread - Please don't worry the crazy BOINC percents and completion times, both are wrong, expect a week or three Dont Panic. And don't go killing these MOSES or hadam3pm2 wu's -- there were problems with them a few months ago, but this batch seems OK -- we really need to test these models again. Keep on crunching e You can probably work it out by now based on percentage competed. :) |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
Seems the same as before, can't handle stop-restarts, so I aborted it: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=17516381 |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The one running solo on my Ivy Bridge has just created zip 8 at 96 hours. This means about 144 hours for the full run, if it goes that far. The one on the Haswell is a bit faster, so a few less hours. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Both have now finished, with a total of 10 zips. The last one is a bit over 95 Megs. The Ivy Bridge took 120 hours 37 minutes. The Haswell took 113 hours. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I got overloaded with the MOSES global II things -- mostly by mishap of bad timing combined with the initial misunderestimation of probable runtime. I've suspended all others, the fastest I get running 5 tasks on 4 real cores + 1ht on Ivy Bridge is 180 hours. The slowest is 340+ hours on oldest AMD - no ht. I think the difficult part for running these things is just the "never interrupt after starting" thing. The mis-estimation of time to end thing is mostly fixed - with a few bizarre remaining glitches -- like why does the next-to-last (but not the last) model think it needs 1700 hours to complete? Anyhow, the underestimation seems to be over -- so --I'll work through the couple dozen of this batch remaining here OK, probably. Thanks all. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
I know (or think I know) that the MOSES models run for 10 years. I am referring to the am3pm2 (LINUX) models only. Does anyone know what the timestep is? I know ANZ is 5 minutes and the PNW tasks don't show the time step on the graphic display. If I knew the how many timesteps there are in 10 years, I could look at my trickles and estimate how long the task might run. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I haven't had any Moses models off this, the main site yet, only from the beta site where some of the estimates have been wildly out. Don't know if it helps but on my I3 CPU on this box the one I have completed, and East Asia model was 779,535.90s cpu time A second East Asia model is almost 99% complete at 347hours and about 4.5 hours to go. An Afr Moses task is showing as 13.2% Complete after 113 hours with 194.5 hours showing as remaining time. The eas tas checkpoints about once an hour. the afr about every 3hours 20mins. I am going to wait till the afr completes before drawing any conclusions. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Mine too are mostly on the beta site, but there's this one from earlier in the year. Click on the trickles page for the full list. Keep in mind that the "MOSES" models are changing a bit over time, at least on beta. The ones I'm testing at the moment are MOSES II + Triffid. I don't know which ones will be used on the main site. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Les - Thanks for the info. Assuming the current models are the same as the beta, the link you gave me shows about 309,000 time steps. My task that I am looking at has gone about 31,000 time steps. So, maybe it is about 10% done. Not the 0.832% done as BOINC shows. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I think that if you multiply the "progress" shown in BoincManager by 12 and apply that to the "elapsed" you will get a real good estimate of the actual progress. In other words, to get a good estimate of total run time (cpu) figure (elapsed)/((progress)*(12)) -- elapsed and progress as shown in BoincMgr 7.42. For example , elapsed=127 hour progress=5.792% . 127/((0.05792)*(12)) = 183 (example from one of my running tasks) This has been true for all the few dozen of these hadam3pm2 global MOSES from the main site that my machines have completed, and the ones running now. Why I dunno. This estimate is much much better than whatever insane underestimate caused my boxes to load up so many of these models when they first came out, and better than some later ones, where the corrections made by Oxford staff worked - but - combined with BOINC trying to correct for the earlier underestimate -- hey some of BOINC's estimates went up into the 1600 hour range :) -- thinking the outlandish 4 or 5 values in client.state.xml <duration_correction_factor> had something to do with that. Anyhow, like I posted elsewhere a while ago, and saw on some announcements thread, these models are priority, not quite ready for prime time (Windows hosts), and need all the help we can give. They are also fragile, in that they will lose an upload file any time they are suspended by user or stopped and restarted for reboot. I think the _k*** and _i*** series from the main site (not the beta site) are actually like a beta-2 series. Anyhow, |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Erik - I think you have something there. How do yours look if you just ignore the Progress %, and just look at the Elapsed and Remaining time (added together for total run time)? |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I have a couple of these running right now. Not _kOOw. One claims to be 7.935 DONE, elapsed 280:54:41 remaining 333:32:32 One claims to be 7.836 DONE, elapsed 280:51:42 remaining 333:35:04 Do not pay attention to the seconds as they were changing as I wrote them down. I am not worried about this, but it is sure confusing. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Les - I think you gave me the piece of information I was looking for. The last trickle on the two links you sent shows 348,548 time steps. Therefore, on one of my tasks the last trickle is at 129,000 time steps, so I am thinking it is a little more than 1/3 done. The task shows 3.183 percent done. This coincides with Erik's observation (stated a different way) that the task will be done when the percentage reaches 8.333% |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
Yeah, progress on these are calculated for a 120 year run. Have complained about this months ago in beta but nothing have happened, well, well.. |
Send message Joined: 31 Mar 13 Posts: 44 Credit: 6,950,896 RAC: 0 |
Some of my hadam3pm2 wu's have had interruptions and I guess that's why they finish with an error of zip files being absent. Are these wu's still useful for something or could they just as well have been aborted? |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I've -- or my machines have -- run a few dozen of these -- And my experience has been that, after the models actually start running, any interruption will result in a lost upload, and eventually, in a failed wu. What that's worth for the science -- I have no clue. Hope it helps. I've just let them run to end, mostly. Luckily, there's been no power outages where I live the last few months, and I've learned to postpone rebootable upgrades to the system until these MOSES jobs are cleared from by boxes. Like I said otherwhere -- these are beta-2 -- and by what was posted - they are priority models. Right now, I run these things as a top priority, and will have another dozen completing near New Years. But they are fragile, don't deal well with any kind of interruption (after they actually start) About my original complaint, totally whacko underestimates of runtime that left me with more than a months worth of models that "shoulda" finished in a week -- heh heh -- minor minor worry. Some of my hadam3pm2 wu's have had interruptions and I guess that's why they finish with an error of zip files being absent. Are these wu's still useful for something or could they just as well have been aborted? |
©2024 cpdn.org