climateprediction.net (CPDN) home page
Thread 'My model is stuck at 75.226%'

Thread 'My model is stuck at 75.226%'

Questions and Answers : Windows : My model is stuck at 75.226%
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
old_user201021

Send message
Joined: 30 Sep 06
Posts: 18
Credit: 93,623
RAC: 0
Message 31116 - Posted: 25 Oct 2007, 16:09:37 UTC
Last modified: 25 Oct 2007, 16:10:37 UTC

Should I just abort it? It also says it is at 2963 hours and I have let it run for days and it should have gone past the 3000 mark but does not seem to be making any progress.
ID: 31116 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 31117 - Posted: 25 Oct 2007, 18:33:34 UTC

What happens in the Time Step count, in Tasks tab/Graphics? Does it progress smoothly but slowly?
Does it fall back a bit, then reprocess the same several TS over and over again?
Is it hung?
Is the globe blue?
Any clues in Messages?
What does it say in Tasks/Status? (The stopped clock suggests the Run is stopped.)
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 31117 · Report as offensive     Reply Quote
old_user201021

Send message
Joined: 30 Sep 06
Posts: 18
Credit: 93,623
RAC: 0
Message 31125 - Posted: 26 Oct 2007, 16:17:11 UTC - in response to Message 31117.  

What happens in the Time Step count, in Tasks tab/Graphics? Does it progress smoothly but slowly?
Does it fall back a bit, then reprocess the same several TS over and over again?
Is it hung?
Is the globe blue?
Any clues in Messages?
What does it say in Tasks/Status? (The stopped clock suggests the Run is stopped.)


The graphics are disabled on this installation. Since I last posted it shows the CPU time at 2964 hours when it should have way more than that. It is set to always run. It is currently at 75.232%.
ID: 31125 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 31128 - Posted: 26 Oct 2007, 20:58:15 UTC

Unfortunately, lack of information keeps us stabbing in the dark. (My intuition is weak.)

Clearly something keeps CPDN from getting time on the machine. It\'s possibly some change you made to Preferences. Did you limit the hours it can run? The conditions under which it runs?

If set to run as a Service, is it actually running?

Did you add another Project or reactivate another Project? If so, boinc could be throttling CPDN while collecting Long Term Debt.

Did you add some long-running non-boinc work at an above-low priority? If so, boinc, which uses only otherwise unused CPU cycles on your machine, wouldn\'t get a chance to run.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 31128 · Report as offensive     Reply Quote
old_user201021

Send message
Joined: 30 Sep 06
Posts: 18
Credit: 93,623
RAC: 0
Message 31129 - Posted: 26 Oct 2007, 21:15:26 UTC - in response to Message 31128.  

Unfortunately, lack of information keeps us stabbing in the dark. (My intuition is weak.)

Clearly something keeps CPDN from getting time on the machine. It\'s possibly some change you made to Preferences. Did you limit the hours it can run? The conditions under which it runs?

If set to run as a Service, is it actually running?

Did you add another Project or reactivate another Project? If so, boinc could be throttling CPDN while collecting Long Term Debt.

Did you add some long-running non-boinc work at an above-low priority? If so, boinc, which uses only otherwise unused CPU cycles on your machine, wouldn\'t get a chance to run.


Something must be wrong. It must be running in some sort of loop. I just checked it again and it says it has only 2963 hours of CPU time. It should never show less time. I think this workunit is screwed and I will have to abort it.
ID: 31129 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31133 - Posted: 27 Oct 2007, 1:30:02 UTC
Last modified: 27 Oct 2007, 1:47:35 UTC

Jeff, don\'t abort this model yet until we\'ve worked out what\'s gone wrong. There may be nothing wrong with the model and the problem could be something simple to put right. Astro is quite right in wanting more information.

At first I thought your model was looping (getting stuck at some point and continually going back to the previous December to try the same model year again). But if that was the case, you\'d see the CPU time increasing while the model wouldn\'t progress to its next trickle point.

It last trickled on 21 Sep at timestep 3110400 and at 3048 hours CPU time. I\'ve converted the seconds shown at that trickle to CPU hours. So you\'re right in saying that the CPU time has gone back. Presumably in boinc manager it still says 2963 hours.

Could you please try to answer all the questions.

1 Are you getting any boinc manager messages about the model?

2 Is this computer running tasks from any other project?

3 In your task manager performance tab, what CPU usage do you see?

4 Have you restored this model from a backup? (What you are seeing could I think be consistent with you restoring a backup made in mid-September when the model was at about 2036, and the model failing to restart after the restore.)

5 In your boinc manager Activity menu, have you selected Run Always or Run according to Preferences?

6 In your boinc manager Tasks window, what does the Status column say for this model?


Cpdn news
ID: 31133 · Report as offensive     Reply Quote
old_user201021

Send message
Joined: 30 Sep 06
Posts: 18
Credit: 93,623
RAC: 0
Message 31138 - Posted: 28 Oct 2007, 0:18:37 UTC - in response to Message 31133.  
Last modified: 28 Oct 2007, 0:19:17 UTC

Jeff, don\'t abort this model yet until we\'ve worked out what\'s gone wrong. There may be nothing wrong with the model and the problem could be something simple to put right. Astro is quite right in wanting more information.

At first I thought your model was looping (getting stuck at some point and continually going back to the previous December to try the same model year again). But if that was the case, you\'d see the CPU time increasing while the model wouldn\'t progress to its next trickle point.

It last trickled on 21 Sep at timestep 3110400 and at 3048 hours CPU time. I\'ve converted the seconds shown at that trickle to CPU hours. So you\'re right in saying that the CPU time has gone back. Presumably in boinc manager it still says 2963 hours.

Could you please try to answer all the questions.

1 Are you getting any boinc manager messages about the model?

2 Is this computer running tasks from any other project?

3 In your task manager performance tab, what CPU usage do you see?

4 Have you restored this model from a backup? (What you are seeing could I think be consistent with you restoring a backup made in mid-September when the model was at about 2036, and the model failing to restart after the restore.)

5 In your boinc manager Activity menu, have you selected Run Always or Run according to Preferences?

6 In your boinc manager Tasks window, what does the Status column say for this model?



Here are the answers:

1.) No messages about the model.
2.) This computer was dedicated to running only CPDN. This is a single CPU machine and it was running two CPDN models at one time. It completed the other one successfully.
3.) CPU usage is erratic but never 100% like it should be. It seems to be going between 30% - 50%.
4.) No backup on this one.
5.) I have it run according to preferences. My preferences for all machines is to run after two minutes of no enduser activity.
6.) Normally it says it is running.
ID: 31138 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31140 - Posted: 28 Oct 2007, 0:49:49 UTC


3. It will only be 100% if you use both of the processors in your HyperThreaded computer. Otherwise it will be 50%.
Unless you\'re using the latest version of BOINC, which has options for setting the maximum amount of processor time to use. This can be set on the server, and also on your computer. The default is, I think, to use less than the max available.

If the cpu is overheating, (and there are a lot of reasons why it might be), then thermal throttling will slow down the processor.


Backups: Here
ID: 31140 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31141 - Posted: 28 Oct 2007, 1:35:19 UTC
Last modified: 28 Oct 2007, 2:06:36 UTC

With the amount of memory the computer has, 512Mb, it\'s best just to let it run one of these HADCM models at a time. So as Les says, the max CPU activity you\'ll see from the model in task manager is 50%.

But while the model seems to be running for Jeff, why isn\'t it trickling? The funny thing is that the model that completed successfully trickled until 3 Oct ie after this model had stopped trickling.

A In the Activity menu of boinc manager, is network activity allowed?

B When you open boinc manager does it say Connected to localhost at the bottom right of the screen?

C Have you made any changes to your firewall setup since mid-September?

D When you still had two models, before the other one completed in early October, did you let them both run at the same time or did you suspend one while the other ran?

E Jeff, could you please type out for us the exact figures you see in your boinc manager tasks window for CPU time, % completed and Time to completion. Then the same information again a few hours later.

(I\'m sorry, I realise this is a bit of a pain. But with a model that\'s so well advanced I think it\'s worth digging to explore the problem.)

Cpdn news
ID: 31141 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31146 - Posted: 28 Oct 2007, 5:09:35 UTC


It could be a slow processing \'iceball\'.

ID: 31146 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 31147 - Posted: 28 Oct 2007, 5:43:18 UTC


. . . but it\'s getting very little CPU time. Apparently.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 31147 · Report as offensive     Reply Quote
old_user201021

Send message
Joined: 30 Sep 06
Posts: 18
Credit: 93,623
RAC: 0
Message 31188 - Posted: 30 Oct 2007, 17:43:06 UTC - in response to Message 31141.  

With the amount of memory the computer has, 512Mb, it\'s best just to let it run one of these HADCM models at a time. So as Les says, the max CPU activity you\'ll see from the model in task manager is 50%.

But while the model seems to be running for Jeff, why isn\'t it trickling? The funny thing is that the model that completed successfully trickled until 3 Oct ie after this model had stopped trickling.

A In the Activity menu of boinc manager, is network activity allowed?

B When you open boinc manager does it say Connected to localhost at the bottom right of the screen?

C Have you made any changes to your firewall setup since mid-September?

D When you still had two models, before the other one completed in early October, did you let them both run at the same time or did you suspend one while the other ran?

E Jeff, could you please type out for us the exact figures you see in your boinc manager tasks window for CPU time, % completed and Time to completion. Then the same information again a few hours later.

(I\'m sorry, I realise this is a bit of a pain. But with a model that\'s so well advanced I think it\'s worth digging to explore the problem.)


A. Network activity is always available
B. It is connected to localhost
C. No changes to firewall
D. Both models were running together for quite a long time then I suspended one of the models since I was unsure if I would have enough time to complete both models and this is a single core single cpu system.
E. CPU Time: 2963:18:32, Progress: 75.226%, To completion: 963:16:52
ID: 31188 · Report as offensive     Reply Quote
old_user201021

Send message
Joined: 30 Sep 06
Posts: 18
Credit: 93,623
RAC: 0
Message 31191 - Posted: 30 Oct 2007, 19:53:14 UTC

Update:

CPU Time: 2963:08:54, Progress: 75.22%, To Completion: 963:26:25
ID: 31191 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31192 - Posted: 30 Oct 2007, 21:24:09 UTC


Well, I think that it\'s in a very slow loop, and should be aborted.
The model has sent back the 3rd 40 year zip, so most of the model has been evaluated. There are millions more combinations to try.

ID: 31192 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31193 - Posted: 30 Oct 2007, 21:52:21 UTC
Last modified: 30 Oct 2007, 21:59:06 UTC

I\'ve never seen anything like what your model\'s doing, Jeff.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6458650
I\'ve recalculated the CPU time for the last few trickles it produced in September to convert from seconds to hours in case I\'d made a mistake, but I hadn\'t.

The 21 Sep trickle at 2040 was at 3048 hours CPU time elapsed.
2nd-last trickle, 3023 hours.
18 Sep trickle at 2037, 2972h CPU time.
17 Sep trickle 2036, 2946h CPU time.

So it\'s as if the model has gone back to somewhere between 2036 and 2037. How it can have done this without you restoring a backup is beyond my understanding. I imagine it\'s been looping ever since you restarted it when the other model finished on 3 Oct.

I would have expected any loop to develop AFTER your last 21 Sep trickle. I\'ve never before seen a model go back about 4 years and then loop continuously.

There\'s something seriously wrong with this model and you\'re going to have to abort it. The CPU time certainly shouldn\'t be going backwards. All the data the model produced up to 2040 including the 2040 zip file will be good and normal and will be used by the researchers.

But could you please wait 24 hours from now before doing this; it\'s such an unusual occurrence that I\'m going to report your model to the moderators\' section. I\'d like to give the other mods a chance to ask you more questions about it before you abort.

I wonder whether there\'s any significance in the numbers 963 appearing in both CPU time and time to completion??????


Cpdn news
ID: 31193 · Report as offensive     Reply Quote
old_user201021

Send message
Joined: 30 Sep 06
Posts: 18
Credit: 93,623
RAC: 0
Message 31196 - Posted: 30 Oct 2007, 22:32:40 UTC

I\'ll wait 24 hours before aborting it.
ID: 31196 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31201 - Posted: 31 Oct 2007, 5:01:12 UTC
Last modified: 31 Oct 2007, 5:04:25 UTC

Just one last thought here. Is this computer overclocked? Sometimes this can cause abnormal model behaviour; in such a case, if the computer is brought back to its standard speed, the model ususally then behaves normally.
Cpdn news
ID: 31201 · Report as offensive     Reply Quote
old_user201021

Send message
Joined: 30 Sep 06
Posts: 18
Credit: 93,623
RAC: 0
Message 31211 - Posted: 31 Oct 2007, 17:20:25 UTC

No overclocking. This computer is running stock from Dell purchased around a year ago or so.
ID: 31211 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31222 - Posted: 31 Oct 2007, 20:25:55 UTC
Last modified: 31 Oct 2007, 20:28:33 UTC

I had to install an extra fan into my Dell because it was overheating. In theory overheating can also cause this problem, although in my case it was because it unexpectedly slowed down by 10% that I discovered the problem.

In any case, running a \'stress-test\' such as Prime95\'s torture test would identify if there is a hardware problem. You need to run one copy per processor-core simultaneously for about 24 hours.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31222 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31229 - Posted: 31 Oct 2007, 23:55:53 UTC

Moderator Thyme Lawn said this about your model this morning in the moderators\' forum, but I\'ve only found time now to copy what he said across to here.

\'The date of the last checkpoint is in the model XML file in projects/climateprediction.net (along with other useful info such as the number restarts).

The stdout_um*.txt files in the model directory might also have something relevant (sydout_um4.txt in particular).

You can view the graphics in a service install if you change the BOINC service logon to local system account and allow it to interact with the desktop. That\'s how all my systems are set up.
\'

Nobody thinks you should try to keep this model, but even after you\'ve aborted it you can if you wish look at those files in the boinc folder as long as you don\'t reset the project, which would delete them all. Exit from boinc before looking into the files, just in case. If you do investigate and find any interesting clues, please let us know.

If you need help with getting your service install to let you see the graphics, I\'m sure Thyme will give you more details.

Cpdn news
ID: 31229 · Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Windows : My model is stuck at 75.226%

©2025 cpdn.org