climateprediction.net (CPDN) home page
Thread 'Computer model stops'

Thread 'Computer model stops'

Message boards : Number crunching : Computer model stops
Message board moderation

To post messages, you must log in.

AuthorMessage
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 39266 - Posted: 19 Mar 2010, 17:47:37 UTC
Last modified: 19 Mar 2010, 17:48:13 UTC

I have a computer model that stopped doing calculations. In the Boinc manager the compute time goes up, but the CPU usage is 0.

The model is hadsm3fub_jre9_006446027_2.


Is there any way to correct this? I do not have any backups.
ID: 39266 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 39267 - Posted: 19 Mar 2010, 18:11:47 UTC
Last modified: 19 Mar 2010, 18:23:53 UTC

Suspend the model.
Exit from BOINC.
Restart BOINC.

Optionally, shut down computer.
Re-start computer.
THEN restart BOINC.

Fixed?

PS
If you hadn\'t said that there was no cpu time being used, I\'d have suggested the possibility that the model got interrupted while post processing the data at the end of phase one to create a zip file.
The last trickle posted is the 2nd last before end-of-phase, and you\'re running several projects. The normal BOINC switching between projects at the critical moment is enough to cause the problem, especially if you don\'t keep models in memory.
In this case, the model would have restarted all over again from the beginning, and no new trickles registered, (or credited), until the model is back to where it was.
This has been posted many times, including, I think, in the News section, but it\'s probably worth repeating.
Backups: Here
ID: 39267 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 39268 - Posted: 19 Mar 2010, 18:33:03 UTC - in response to Message 39267.  

Suspend the model.
Exit from BOINC.
Restart BOINC.

Optionally, shut down computer.
Re-start computer.
THEN restart BOINC.

Fixed?

PS
If you hadn\'t said that there was no cpu time being used, I\'d have suggested the possibility that the model got interrupted while post processing the data at the end of phase one to create a zip file.
The last trickle posted is the 2nd last before end-of-phase, and you\'re running several projects. The normal BOINC switching between projects at the critical moment is enough to cause the problem, especially if you don\'t keep models in memory.
In this case, the model would have restarted all over again from the beginning, and no new trickles registered, (or credited), until the model is back to where it was.
This has been posted many times, including, I think, in the News section, but it\'s probably worth repeating.


I restarted the computer, the model started computing again, but got into the same state within 15 minutes.
ID: 39268 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 39269 - Posted: 19 Mar 2010, 20:31:19 UTC

Now that I have 2 quad cores that are very stable, I don\'t mess around with models that fail; I just abort them.
If yours has failed twice, then it\'s likely to be unstable, so I\'d abort it and get another one. You\'ve given it 2 chances, and returning it will provide info about while it failed for the project people.


Backups: Here
ID: 39269 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39270 - Posted: 19 Mar 2010, 20:39:34 UTC

I see you\'ve aborted the model now. What a strange thing. Did you look at the graphics?
Cpdn news
ID: 39270 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 39274 - Posted: 20 Mar 2010, 4:37:15 UTC - in response to Message 39270.  

I see you\'ve aborted the model now. What a strange thing. Did you look at the graphics?


The graphics worked, I could rotate the earth, but the date of the model was not increasing. The CPU usage went to 0% as well.

This is probably a funny coincidence, but the % done was 33.33.

I tried restarting the computer another time and then aborted the model.
ID: 39274 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39276 - Posted: 20 Mar 2010, 8:40:25 UTC

It was a HadSM model so at 33.33% it would have been post-processing at the end of phase 1. As far as I know, during post-processing the model stops advancing ie the date stays the same. The post-processing for HadSM and HadSM MH models lasts quite a long time while the file is created. But I\'ve never looked at the CPU usage in Task Manager during post-processing; I can\'t imagine it would be 100% or 0% - more probably it would spike.

If this happens again leave the model for half-an-hour or more without shutting down anything to see whether it gets through into the next phase. These HadSM models are notorious for hating to be disturbed during post-processing.
Cpdn news
ID: 39276 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 39278 - Posted: 20 Mar 2010, 20:40:36 UTC - in response to Message 39276.  

It was a HadSM model so at 33.33% it would have been post-processing at the end of phase 1. As far as I know, during post-processing the model stops advancing ie the date stays the same. The post-processing for HadSM and HadSM MH models lasts quite a long time while the file is created. But I\'ve never looked at the CPU usage in Task Manager during post-processing; I can\'t imagine it would be 100% or 0% - more probably it would spike.

If this happens again leave the model for half-an-hour or more without shutting down anything to see whether it gets through into the next phase. These HadSM models are notorious for hating to be disturbed during post-processing.


When I first noticed that the model was stuck, I looked in the task manager, and the CPU usage was 0 for about 6~7 hours. So I think that something did go terribly wrong with this particular model.
ID: 39278 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39279 - Posted: 21 Mar 2010, 1:25:24 UTC

Just as well that you noticed the problem and aborted it.
Cpdn news
ID: 39279 · Report as offensive     Reply Quote

Message boards : Number crunching : Computer model stops

©2024 cpdn.org