climateprediction.net (CPDN) home page
Thread 'Is this worth restarting from backup?'

Thread 'Is this worth restarting from backup?'

Message boards : Number crunching : Is this worth restarting from backup?
Message board moderation

To post messages, you must log in.

AuthorMessage
mray

Send message
Joined: 30 Apr 06
Posts: 8
Credit: 10,879,641
RAC: 2,343
Message 25144 - Posted: 16 Nov 2006, 21:43:49 UTC

After going to 95% without problems I\'ve now had two crashes. First one was related to a network write error and I restarted from my backup. Lost 5 days of CPU time. Now, just as I passed the previous point and gained a few days it crashed again. I don\'t know what caused it this time. Here is a pic of the message log:



The tasks panel says \"computation error\". Is this related to a write error on my side or CPDNs? Should I go back to the backup? It hasn\'t reported back yet apparently and I stopped the client just in case. I\'d hate to lose these many months of CPU time.


It\'s the same backup I went to last time!

ID: 25144 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 25145 - Posted: 16 Nov 2006, 22:04:42 UTC

There are many reasons why models fail; sometimes its a hardware problem, perhaps momentary, sometimes it\'s a conflict with another program, (which can also involve Windows), sometimes it\'s an \'operator error\', and sometimes it\'s just the combination of values for the many parameters used to start the model.

This latter is part of the research; to find starting values that will create a model that \'runs for ever\'.
Sometimes the values are just right, and the model continues for the whole 160 years.
Sometimes the model will crash soon after starting, or halfway through.
But it\'s all valuable information for the researchers.

And even if a model does last the full 160 years, it doesn\'t mean that it would last for 200, or 500, etc. Which I suspect is something that the researchers would also like to know. But that will have to wait for another generation or two of computer hardware. :)

You could try again, and this time make another backup just before where it crashed each time, in case you want one last go.
Or you could cut your loses, and just let it go.
In the end, it\'s up to you.

ID: 25145 · Report as offensive     Reply Quote
mray

Send message
Joined: 30 Apr 06
Posts: 8
Credit: 10,879,641
RAC: 2,343
Message 25146 - Posted: 16 Nov 2006, 22:16:52 UTC - in response to Message 25145.  

I\'m giving it one more shot and I will backup daily this time. I\'ve got way too many hours on this WU. I\'d still like to know what exactly the error means though. Was it a write failure to the HDD or the Internet? My systems fault or CPDNs or what?



ID: 25146 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 25147 - Posted: 16 Nov 2006, 22:32:18 UTC

Sorry, but it\'s not possible to say what happened until we see the error data written to the Oxford server.

It may have been a calculation error, rather than a write error.
Some of the maths results, when tested by part of the program, may have indicated impossible values, so the program aborted.

The error data would be in one of the many files in the BOINC set of folders, but I don\'t know where. It would, though, be a problem on your computer, as I said before, and not on the net or the Oxford servers.

All of the data created so far as part of the model gets uploaded to the server at regular intervals, so that contribution isn\'t lost.
Data is sent back yearly, with a larger chunk every 10 years, and a big restart dump every 40 years.

ID: 25147 · Report as offensive     Reply Quote

Message boards : Number crunching : Is this worth restarting from backup?

©2024 cpdn.org