climateprediction.net (CPDN) home page
Thread 'Model restarted after part completion'

Thread 'Model restarted after part completion'

Questions and Answers : Windows : Model restarted after part completion
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user118955

Send message
Joined: 27 Nov 05
Posts: 6
Credit: 18,962
RAC: 0
Message 18122 - Posted: 13 Dec 2005, 3:16:52 UTC

My climate model (sulphur) had completed 3% in 90 hours CPU time from 25/11/05 to 12/12/05 and then between 8 and 10pm last night (12/12/05) it restarted at 0%. The computer was left running this (no one was present) and I\'ve checked all files in Boinc directory and there is no record of the previous progress except for the granted credit from 3 trickles. Also, my account on the server is reading 0% (I updated it when I discovered the problem because I thought that the server info would update on my machine if it was a cpu or update error). It is the same model and with the same deadline and no errors are reported on the server or in the log on my PC.

This is bad because there is little chance of the model getting finished in time if it updates to 0%. Also, I did not have anything else running that could have maxed out the CPU. Windows Media Player also crashed during this time. My PC specs are as follows:

AMD Athlon 1.26GHz (Barton core) with 1GB RAM running Boinc 5.2.11 on XP Service pack 2 with ZoneAlarm firewall and several anitvirus/antispyware programs

How can the project reset itself or is my CPU struggling with the floating point calculations/have a memory leak? I have also noticed that the extimated time of 1900 hours has gone up to 2200 hours (and looks likely to be 2700 hours). Also would running several other projects in Boinc increase the chance of this problem with restarting the project when it stops and switches between projects.

What can you advise - for example is it better for me to concentrate on other projects until I have more computing power?

- Tim.
ID: 18122 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 18125 - Posted: 13 Dec 2005, 3:50:47 UTC

Did the computer get accidentally shut down or rebooted?

This is about the only thing that will occasionlly cause this problem to happen. It\'s happened twice to me over the last 15 months (I have several computers).

The estimated time to completion will now be unreliably inflated. What happens is, for whatever reason with an accidental reboot/shutdown, the timesteps get set back to timestep 1 of phase 1, while the CPU time is retained. This results in an unrealistically large sec/TS and an unrealistically large estimated time to completion. What will happen is that you won\'t see any more trickles or credit until you pass the point you were before the problem happened.

It sucks.
ID: 18125 · Report as offensive     Reply Quote
old_user118955

Send message
Joined: 27 Nov 05
Posts: 6
Credit: 18,962
RAC: 0
Message 18127 - Posted: 13 Dec 2005, 4:39:57 UTC - in response to Message 18125.  

Did the computer get accidentally shut down or rebooted?

This is about the only thing that will occasionlly cause this problem to happen. It\'s happened twice to me over the last 15 months (I have several computers).

> No it had been on for over a week but my computer did restart after an automatic security update before then and the work was not lost - but this could have been on another Boinc project so I may not have spotted it.

The estimated time to completion will now be unreliably inflated. What happens is, for whatever reason with an accidental reboot/shutdown, the timesteps get set back to timestep 1 of phase 1, while the CPU time is retained. This results in an unrealistically large sec/TS and an unrealistically large estimated time to completion. What will happen is that you won\'t see any more trickles or credit until you pass the point you were before the problem happened.

It sucks.


> It is not good at all but I am lucky that it was 2 weeks of CPU time and not 6 months worth.

Also, if I did not update the server then would I have been able to retrieve the lost results from my PC? If so then I will know better next time. If not then ClimatePrediction and Boinc should get their act together and sort out an automatic backup of progress on the client PC (or at least give step by step process for regular manual backup). This is in the interest of Boinc, the clients and the scientific researchers.

Also what I don\'t understand is why the project has to start from the beginning if there is a problem when the trickles are on the server and the data analysis point could be retrieved in the model from either the server or a backup on the client PC. Is this because all parts of the model are interdependent and the percentage completed cannot be used to find the point of the model where previous analysis has stopped? As I understand it the only way the model continues in Boinc is from the file on the client PC that updates the server and which is set back to 0% if an error or interruption occurs. This is crazy and analogous to losing all your data on your computer without a recent backup. Now that is what really sucks!

- Tim.
ID: 18127 · Report as offensive     Reply Quote
old_user118955

Send message
Joined: 27 Nov 05
Posts: 6
Credit: 18,962
RAC: 0
Message 18128 - Posted: 13 Dec 2005, 5:53:11 UTC

An unrecoverable error has appeared (file transfer error) - cannot think why this would have occurred except for the problem outlined in this post. On the server it says \'client Error\' but as far as I can tell I have done nothing wrong.

Have just received new model so it will be interesting if I get credited for trickles before 3% is done. I wonder how long it will be before the same thing happens again?!

- Tim.
ID: 18128 · Report as offensive     Reply Quote
old_user94880

Send message
Joined: 27 Aug 05
Posts: 156
Credit: 112,423
RAC: 0
Message 18129 - Posted: 13 Dec 2005, 6:55:54 UTC

Why was windows media running? \"Windows Media Player also crashed during this time.\"
BOINC Wiki
ID: 18129 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 18130 - Posted: 13 Dec 2005, 7:18:28 UTC

Tim
You seem to have made up your mind that it\'s all BOINC\'s fault, (and to be confused between this and the science program), but if you would like to calm down and ask for help, I\'ll tell you where to find the proceedure for backup and restore which was devised a few months ago for multi-project people.
Automated backup for single project people was worked out ages ago.

ID: 18130 · Report as offensive     Reply Quote
old_user118955

Send message
Joined: 27 Nov 05
Posts: 6
Credit: 18,962
RAC: 0
Message 18156 - Posted: 13 Dec 2005, 17:48:29 UTC

It appears that another Sulphur model has just aborted with the same unrecoverable error (file transfer error) - this time without doing any work on the model. I am worried that something has gone wrong with my PC (but the other Boinc projects work fine).

Also, it would be useful to know about the automated backup facility...

- Tim
ID: 18156 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 18169 - Posted: 13 Dec 2005, 23:34:02 UTC

Couldn\'t find it for a while. Great!
As it turns out, all of what I mentioned is in the one thread, <a href=\"http://www.climateprediction.net/board/viewtopic.php?t=2377\"> here.</a>

ID: 18169 · Report as offensive     Reply Quote
old_user118955

Send message
Joined: 27 Nov 05
Posts: 6
Credit: 18,962
RAC: 0
Message 18171 - Posted: 14 Dec 2005, 2:07:04 UTC - in response to Message 18169.  

Couldn\'t find it for a while. Great!
As it turns out, all of what I mentioned is in the one thread, <a href=\"http://www.climateprediction.net/board/viewtopic.php?t=2377\"> here.</a>



Thanks Les. I think the easiest way to create a backup in latest Boinc is:

1) stop new work for the other projects and leave them running
2) update other projects with results (not CPDN)
3) stop Boinc and copy directory to new backup
4) restart Boinc and get work for other projects

If CPDN fails then to restore:

1) stop new work for all projects and leave them running
2) update other projects with results (not CPDN)
3) stop Boinc and copy directory from latest backup
4) restart Boinc and get work for other projects

Just one question - would the older XML script cause the other projects to retrieve or process old work units even if results were previously updated to the server?

- Tim.
ID: 18171 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 18175 - Posted: 14 Dec 2005, 4:23:58 UTC

That last bit is beyond what I know. I\'m a \'dos\' person, with a small bit of Unix.
Once windows went dos-less, I gave up. The only windows stuff I\'ve gotten the hang of is some html.
And BOINC is an art form to itself.

Not to worry, there are lots of helpful people prowling the boards.
I look forward to seeing what they say.

ID: 18175 · Report as offensive     Reply Quote

Questions and Answers : Windows : Model restarted after part completion

©2025 cpdn.org