Message boards : Number crunching : Computer model stops
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
I have a computer model that stopped doing calculations. In the Boinc manager the compute time goes up, but the CPU usage is 0. The model is hadsm3fub_jre9_006446027_2. Is there any way to correct this? I do not have any backups. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Suspend the model. Exit from BOINC. Restart BOINC. Optionally, shut down computer. Re-start computer. THEN restart BOINC. Fixed? PS If you hadn\'t said that there was no cpu time being used, I\'d have suggested the possibility that the model got interrupted while post processing the data at the end of phase one to create a zip file. The last trickle posted is the 2nd last before end-of-phase, and you\'re running several projects. The normal BOINC switching between projects at the critical moment is enough to cause the problem, especially if you don\'t keep models in memory. In this case, the model would have restarted all over again from the beginning, and no new trickles registered, (or credited), until the model is back to where it was. This has been posted many times, including, I think, in the News section, but it\'s probably worth repeating. Backups: Here |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
Suspend the model. I restarted the computer, the model started computing again, but got into the same state within 15 minutes. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Now that I have 2 quad cores that are very stable, I don\'t mess around with models that fail; I just abort them. If yours has failed twice, then it\'s likely to be unstable, so I\'d abort it and get another one. You\'ve given it 2 chances, and returning it will provide info about while it failed for the project people. Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I see you\'ve aborted the model now. What a strange thing. Did you look at the graphics? Cpdn news |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
I see you\'ve aborted the model now. What a strange thing. Did you look at the graphics? The graphics worked, I could rotate the earth, but the date of the model was not increasing. The CPU usage went to 0% as well. This is probably a funny coincidence, but the % done was 33.33. I tried restarting the computer another time and then aborted the model. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
It was a HadSM model so at 33.33% it would have been post-processing at the end of phase 1. As far as I know, during post-processing the model stops advancing ie the date stays the same. The post-processing for HadSM and HadSM MH models lasts quite a long time while the file is created. But I\'ve never looked at the CPU usage in Task Manager during post-processing; I can\'t imagine it would be 100% or 0% - more probably it would spike. If this happens again leave the model for half-an-hour or more without shutting down anything to see whether it gets through into the next phase. These HadSM models are notorious for hating to be disturbed during post-processing. Cpdn news |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
It was a HadSM model so at 33.33% it would have been post-processing at the end of phase 1. As far as I know, during post-processing the model stops advancing ie the date stays the same. The post-processing for HadSM and HadSM MH models lasts quite a long time while the file is created. But I\'ve never looked at the CPU usage in Task Manager during post-processing; I can\'t imagine it would be 100% or 0% - more probably it would spike. When I first noticed that the model was stuck, I looked in the task manager, and the CPU usage was 0 for about 6~7 hours. So I think that something did go terribly wrong with this particular model. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Just as well that you noticed the problem and aborted it. Cpdn news |
©2024 cpdn.org