climateprediction.net (CPDN) home page
Thread 'Multiple CP task management'

Thread 'Multiple CP task management'

Questions and Answers : Unix/Linux : Multiple CP task management
Message board moderation

To post messages, you must log in.

AuthorMessage
fortran

Send message
Joined: 28 Jun 06
Posts: 20
Credit: 1,349,578
RAC: 0
Message 51405 - Posted: 13 Feb 2015, 17:42:54 UTC

Sorry if this is already asked, nothing jumped out looking at topics.

Not too long ago, I started seeing some tasks which have estimated running times on the order of 1000 hours. I had mostly been seeing tasks of about 1 day. My machine has 2 cores, and hence if boinc tasks are running, I only have 2 tasks running. If there is one CP task running and one other task running, at the current time the CP task which is running is this 1000 hour one. Which means the other task (about half done) just sits there. I have manually suspended the big task, to get this other task close to completion, and then I will release the hold in the hope the other one will manage to get some time "accidentally" (luck of the round robin draw, so to speak). But is there something else I should do to keep this small task from being stalled by the long task?

ID: 51405 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,025,554
RAC: 20,468
Message 51406 - Posted: 13 Feb 2015, 18:32:18 UTC - in response to Message 51405.  

you can go to your preferences under your account page for CP and your other project and edit the resource share. If you set CP at 200 and the other project at 100 CP will get twice as much cpu time as the other one. There is more about this in the preferences section of the forum.
ID: 51406 · Report as offensive     Reply Quote
fortran

Send message
Joined: 28 Jun 06
Posts: 20
Credit: 1,349,578
RAC: 0
Message 51407 - Posted: 13 Feb 2015, 19:05:21 UTC - in response to Message 51406.  

I had juggled those settings in the past. I'll just go with the manual suspend to see how that works for now. Thanks.

ID: 51407 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 51410 - Posted: 13 Feb 2015, 22:35:19 UTC

Are you sure that your preferences are set to use 100% of CPU and to use all 2 cores available?

If it is not on 100% then BOINC sees the 1000 hour work unit as more important and throws most resources at that problem.

Conan
ID: 51410 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51413 - Posted: 14 Feb 2015, 0:43:47 UTC

Also, does your computer run all day, or only for a few hours a day?

ID: 51413 · Report as offensive     Reply Quote
fortran

Send message
Joined: 28 Jun 06
Posts: 20
Credit: 1,349,578
RAC: 0
Message 51414 - Posted: 14 Feb 2015, 16:25:30 UTC - in response to Message 51413.  

Computer runs all the time, and BOINC can use all the CPU if there are no tasks running in the foreground, so to speak.

ID: 51414 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51415 - Posted: 14 Feb 2015, 23:51:19 UTC

It's best to just let BOINC get on with working out how long each type of model takes on YOUR computer.

If your computer is thinking 1000 hours, then it's way out, and will slowly decrease that estimate as it runs the model.

On my Ivy Bridge, the Moses II models took about 31 hours each, the long hadam3p eu models are taking about 45 hours, and the short hadam3p eu models are taking about 11 hours.

The MOSES II models were mostly in beta testing which had many versions, and I'm not sure which ones made it to the main site.

I'll get another one, and see how it goes.

ID: 51415 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 51416 - Posted: 15 Feb 2015, 5:40:12 UTC

Following up from what Les said...

On my main machine, an i7-4770s (3.1GHz clock) running Ubuntu 14.04 and using hyperthreading, the longest jobs I have seen ran for under 300 hours - these were hadcm3n.

hadcm3s jobs typically took just under a day, the longer hadam3p_eu jobs about 67 hours, the short hadam3p_eu jobs about 17 hours. All quite quick...

The ones that really mess with BOINC Manager's Tasks display are the "original" Moses II jobs - hadam3pm2 - there seems to be a problem in the way they communicate status, and when they report 8.3% progress they are, in fact, nearly finished!!! (I think there's a "factor of 12" problem in there somewhere!)

Note, however, that Moses+Triffid tasks (hadam3prm3pm2t_eu) seem to show an accurate progress rate.

In practice, on my main machine a Moses II (hadam3pm2) job will run for about 175 hours, whilst a Moses+Triffid job will run for about 180 hours.

As I don't know what hardware you have, I can't guess how long a job might take on your machine; a slower machine of mine (an i3-2100 at the same clock rate) typically takes about 50% longer to run. I've never run CP jobs on my laptop, so I've no idea how they'd go on a 2GHz clock machine...

Hope the above might reassure you somewhat. As Les says, you might as well just leave it to it!

ID: 51416 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51417 - Posted: 15 Feb 2015, 5:48:21 UTC

Back again. I got 2 of the UK Met Office HadAM3P (global only) with MOSES II landsurface scheme v7.03 , and the estimate to completion is 241 hours.

They're going to have to wait a few hours until some shorter models finish, but they may end up at about 127 hours, which is what something similar took last year on beta.


ID: 51417 · Report as offensive     Reply Quote
fortran

Send message
Joined: 28 Jun 06
Posts: 20
Credit: 1,349,578
RAC: 0
Message 51420 - Posted: 15 Feb 2015, 20:53:34 UTC - in response to Message 51417.  

This long job is a hadam3p. It is 192 hours in, 710 hours to go. So, inaccurate time estimation isn't the problem.

AMD 64 X2 4800+ dual core. Not a new CPU, but it beat the heck out of the VAX 11/785 I did my M.Eng. on in the mid 1980's.

I am going to put in a new machine that should be about 30% faster, which on 900 to 1000 hours for my old machine, is still a long job.

ID: 51420 · Report as offensive     Reply Quote
MyLittleBoinc

Send message
Joined: 31 Mar 13
Posts: 44
Credit: 6,950,896
RAC: 0
Message 51421 - Posted: 15 Feb 2015, 21:53:15 UTC

You have done 106,628 timesteps and a completed task runs to 348,548 timesteps, so you have done just over thirty procent.

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/trickle.php?resultid=17770651
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/trickle.php?resultid=17488554
ID: 51421 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51422 - Posted: 15 Feb 2015, 22:31:17 UTC

fortran

There was a model type loose with a wrong time-to-completion estimate.
It was beta tested as a 1 year, and released without change, but as a 10 year type.
So the estimate is one tenth of the correct value.

It'll be another few hours before I start the 2 that I have, and then I'll be able to see if that's the problem. And email the project if it is.

However, I've found a main site one on my Haswell here from last November. If you look at the trickle list at the bottom, you'll see that it was running at about 1.65 seconds per timestep, whereas yours, here, is running at about 5.25 seconds per timestep.

So yours is about 3 times slower, and will take about 3 times longer to finish.
According to my logs, mine took 113 hours to complete.

If I remember correctly, I noted that the percentage estimate was about 8 or 9 percent completed when it finished.

And there's a thread here in Number crunching about this problem, from back in November.

3 hours and 49 minutes before I finish the others, and then the 2 MOSES II models can have the computer all to themselves.

ID: 51422 · Report as offensive     Reply Quote
fortran

Send message
Joined: 28 Jun 06
Posts: 20
Credit: 1,349,578
RAC: 0
Message 51423 - Posted: 16 Feb 2015, 0:17:02 UTC - in response to Message 51422.  

I am not complaining about how long it takes. It takes what it takes. I have been doing numerical methods for a long time, I don't have a problem with long run times.

I just noticed that this 1000 hour job was taking cycles in preference to a "normal" job which is around 1 or 2 days.

That normal job finished, and climate prediction hasn't downloaded any more jobs. So I still have this one long job running (along with jobs from 2 other projects). Which is fine.

I just thought that if this sort of situation is common, it might be better if the short job finished earlier. But from the early responses in this thread, that ability is not present in BOINC, or ClimatePrediction's use of BOINC. Which is fine, I can put a manual suspend on the long job to get the short job through.

I suppose some people run these models to see the pictures. I don't even know if pictures are available. I just run the models.

Occasionally I do some Monte Carlo stuff, and can chew up a few hours of CPU time with that, and BOINC waits, which is what it should do. But BOINC is the biggest consistent load my computer sees.

ID: 51423 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51424 - Posted: 16 Feb 2015, 7:09:55 UTC

BOINC has a correction factor to allow it to learn how much processing time each project takes for it's work. But there's only one per computer per project, so for cpdn, which has both very long and very short tasks, it has problems juggling this value.

A long time ago, version 5 I think, a rough rule of thumb was that it took BOINC about 10 completed tasks from a project to "learn" about that project. For cpdn, this meant/means a LONG time. And I don't know what applies now, in version 7.

The "correction factor" is Task duration correction factor, which can be found near the bottom of each computer's page in your account page.

The best way to deal with a mix of long and short models, is to let BOINC get on with things without manual adjustments.

ID: 51424 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,025,554
RAC: 20,468
Message 51425 - Posted: 16 Feb 2015, 8:17:35 UTC

This task http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=17814513 name hadam3pm2_k2us_1959_10_009464927_2 has so far produced 8zips yet is only showing as 6.79% complete. Not too worried - it is on a laptop with a dodgy screen and battery which needs replacing as it only lasts about 15 minutes. Machine will be retired (freecycled) when task finishes but it illustrates the problem. I will be unable to check the machine for about a week but am hoping it might be close to finishing when I do.
ID: 51425 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Multiple CP task management

©2024 cpdn.org