climateprediction.net (CPDN) home page
Thread 'task run time confusion'

Thread 'task run time confusion'

Message boards : Number crunching : task run time confusion
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user452817

Send message
Joined: 21 May 07
Posts: 15
Credit: 5,190
RAC: 0
Message 31793 - Posted: 20 Dec 2007, 12:05:02 UTC

I do not understand what has been happening with this project\'s tasks.

This computer is a Core 2 Duo machine, 2.5 gig CPU, 2 gigs DRAM.

A couple of days ago, I saw that the current task had 260 hours to run. It was about 50% finished. This computer runs four tasks, 2 each alternating on each core. The run time is set at 60 minutes. The computer runs 132 hours a week, so 33 hours on each task each week. That means the task should have needed just under 8 weeks to finish.

Now, this morning, I see that I have a task, over 400 hours to go, only a couple of percent finished.

So, I do not understand what is happening, and I have detached from the project.

If someone can explain to me what is happening, then I might re-attach. I think that this is an important project and I would like to be on it. But I have to know what is going on.

Any help will be appreciated.

>>RSM

When i checked this morning,
>>RSM
ID: 31793 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31796 - Posted: 20 Dec 2007, 16:09:41 UTC


Hi,

Which model was it which was displaying the problem? It may be that it was a HadSM3 model which turned into an slow iceworld (in which case the best thing to do is to abort that particular model). Not all iceworlds are slow, ones which run at normal speed should be left to run.

Alternatively, it could have been a HadCM3 model which was doing a day/month/year retry, in which case it\'ll either fix itself and continue running normally, or it might automatically abort itself if it decides that it\'s climate is unrecoverable.

Note that detaching from the project will have destroyed the model(s) you were working on.

I would recommend going though the \'README\' files which are linked from my signature to get more background information.


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31796 · Report as offensive     Reply Quote
old_user452817

Send message
Joined: 21 May 07
Posts: 15
Credit: 5,190
RAC: 0
Message 31798 - Posted: 20 Dec 2007, 16:20:14 UTC - in response to Message 31797.  


Hi,

Which model was it which was displaying the problem? It may be that it was a HadSM3 model which turned into an slow iceworld (in which case the best thing to do is to abort that particular model). Not all iceworlds are slow, ones which run at normal speed should be left to run.

Alternatively, it could have been a HadCM3 model which was doing a day/month/year retry, in which case it\'ll either fix itself and continue running normally, or it might automatically abort itself if it decides that it\'s climate is unrecoverable.

Note that detaching from the project will have destroyed the model(s) you were working on.

I would recommend going though the \'README\' files which are linked from my signature to get more background information.




So, unfortunately I cannot tell you which model was running on this machine. But, I am running the project with a current task on another fast machine. The statistics on that task closely mirrored the task I killed, it was initiated at the same time. I will look at that project when I get home from work. If it is behaving properly, I will leave it alone. If it behaved like this one, then I will come back to you.

Many thanks for your help.

>>RSM

>>RSM
ID: 31798 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31799 - Posted: 20 Dec 2007, 16:38:46 UTC

Hi Mitrichr

One of your previous models crashed with a 107 exit code. In the README about crashes and other problems it would be worth looking at item #5 by Mike. Lots of ideas and suggestions there.
Cpdn news
ID: 31799 · Report as offensive     Reply Quote
old_user452817

Send message
Joined: 21 May 07
Posts: 15
Credit: 5,190
RAC: 0
Message 31800 - Posted: 20 Dec 2007, 16:41:29 UTC - in response to Message 31799.  

Hi Mitrichr

One of your previous models crashed with a 107 exit code. In the README about crashes and other problems it would be worth looking at item #5 by Mike. Lots of ideas and suggestions there.


Hey, thanks, I will look at it.

>>RSM

>>RSM
ID: 31800 · Report as offensive     Reply Quote
old_user452817

Send message
Joined: 21 May 07
Posts: 15
Credit: 5,190
RAC: 0
Message 31803 - Posted: 20 Dec 2007, 20:41:24 UTC

I am not home yet to check my other computer; but here it is is a nutshell:

I was running this project on two Core 2 Duo machines set at 60 minutes task run time. Since the tasks are very long, that means that this project had one quarter of my crunch time for a long time.

I don\'t know if the project has problems; but they should not be visited on people volunteering computer time. There are other projects with no problems on which I could get something done.

So, you can be sure that if the other machine exhibits the same behavior, I will be gone.

Otherwise, you should be able to tell me that I finished a task very recently.

>>RSM


>>RSM
ID: 31803 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31806 - Posted: 20 Dec 2007, 21:30:12 UTC
Last modified: 20 Dec 2007, 21:36:10 UTC

The relevant word there is volunteer.

ALL projects have problems from time to time.

It\'s entirely up to people to decide which projects they run.
AND to read the info on each project\'s web site to see what it\'s about.

This project is well know to stress a computer far more than any of the other projects, becuase of the complex formulae used in climate models. This runs the floating point maths unit in a cpu at a constant high rate.
Not all computers are suitable for this project, but there are lots of other simpler projects.

As for when /if your models finish, you can see this yourself. Just go to the Results page on your account and look.

And just in case you think that ALL of the climate models should run to the end without problems, they don\'t necessarily do this.
Part of the project is to find sets of starting values that will cause a model to fail at an early stage. There\'s only one way to do this, and that\'s to try these values and see what happens.

ID: 31806 · Report as offensive     Reply Quote
old_user452817

Send message
Joined: 21 May 07
Posts: 15
Credit: 5,190
RAC: 0
Message 31816 - Posted: 21 Dec 2007, 3:19:47 UTC - in response to Message 31806.  

The relevant word there is volunteer.

ALL projects have problems from time to time.

It\'s entirely up to people to decide which projects they run.
AND to read the info on each project\'s web site to see what it\'s about.

This project is well know to stress a computer far more than any of the other projects, becuase of the complex formulae used in climate models. This runs the floating point maths unit in a cpu at a constant high rate.
Not all computers are suitable for this project, but there are lots of other simpler projects.

As for when /if your models finish, you can see this yourself. Just go to the Results page on your account and look.

And just in case you think that ALL of the climate models should run to the end without problems, they don\'t necessarily do this.
Part of the project is to find sets of starting values that will cause a model to fail at an early stage. There\'s only one way to do this, and that\'s to try these values and see what happens.



Well, thanks, but no thanks.

First, I have no problems running CPDN. I have finished tasks. The two machines on which I was running the project are Core 2 Duo\'s, 2.5 gig with 2 gigs of DDR2 DRAM. I left with 4000 plus points, what ever that means.

Second, it is my opinion that when an organization, CPDN, comes to a market, and we are the market, the organization should have done all of its testing of it product model. I am not interested in driving my new Hupmobile out of the show room only to have the wheels fall off.

At WCG the new Cancer project has had problems with fast processors, quads, Zeons, etc., and those guys just said, hey, let us know when you are ready for our power, meanwhile we will go elsewhere and get some work done for someone else.

There are just not that many people into this activity, maybe a million. SETI claims about 500,000, and they are 50% of BOINC\'s activity. WCG claims 300,000.
We are interested in getting work done.

So, see ya.

>>RSM

>>RSM
ID: 31816 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31823 - Posted: 21 Dec 2007, 15:14:51 UTC

CPDN models have all been carefully beta tested by experienced crunchers. The precautions required to complete models are detailed in the post I recommended earlier. The same precautions in fact apply to the workunits of all BOINC projects - they\'re just even more important here because the longer a task runs on a computer (however good it is) the more likely it is that one of the problems will occur. That\'s why most CPDN crunchers who successfully complete their models take frequent backups of the entire BOINC folder. If a model crashes it can then in most cases be restored and continued.

I agree that because these models are so long, they require a special sort of dedication on the part of the cruncher.
Cpdn news
ID: 31823 · Report as offensive     Reply Quote

Message boards : Number crunching : task run time confusion

©2024 cpdn.org