Message boards : Number crunching : Is this worth restarting from backup?
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Apr 06 Posts: 8 Credit: 10,883,804 RAC: 1,674 |
After going to 95% without problems I\'ve now had two crashes. First one was related to a network write error and I restarted from my backup. Lost 5 days of CPU time. Now, just as I passed the previous point and gained a few days it crashed again. I don\'t know what caused it this time. Here is a pic of the message log: The tasks panel says \"computation error\". Is this related to a write error on my side or CPDNs? Should I go back to the backup? It hasn\'t reported back yet apparently and I stopped the client just in case. I\'d hate to lose these many months of CPU time. It\'s the same backup I went to last time! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There are many reasons why models fail; sometimes its a hardware problem, perhaps momentary, sometimes it\'s a conflict with another program, (which can also involve Windows), sometimes it\'s an \'operator error\', and sometimes it\'s just the combination of values for the many parameters used to start the model. This latter is part of the research; to find starting values that will create a model that \'runs for ever\'. Sometimes the values are just right, and the model continues for the whole 160 years. Sometimes the model will crash soon after starting, or halfway through. But it\'s all valuable information for the researchers. And even if a model does last the full 160 years, it doesn\'t mean that it would last for 200, or 500, etc. Which I suspect is something that the researchers would also like to know. But that will have to wait for another generation or two of computer hardware. :) You could try again, and this time make another backup just before where it crashed each time, in case you want one last go. Or you could cut your loses, and just let it go. In the end, it\'s up to you. |
Send message Joined: 30 Apr 06 Posts: 8 Credit: 10,883,804 RAC: 1,674 |
I\'m giving it one more shot and I will backup daily this time. I\'ve got way too many hours on this WU. I\'d still like to know what exactly the error means though. Was it a write failure to the HDD or the Internet? My systems fault or CPDNs or what? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Sorry, but it\'s not possible to say what happened until we see the error data written to the Oxford server. It may have been a calculation error, rather than a write error. Some of the maths results, when tested by part of the program, may have indicated impossible values, so the program aborted. The error data would be in one of the many files in the BOINC set of folders, but I don\'t know where. It would, though, be a problem on your computer, as I said before, and not on the net or the Oxford servers. All of the data created so far as part of the model gets uploaded to the server at regular intervals, so that contribution isn\'t lost. Data is sent back yearly, with a larger chunk every 10 years, and a big restart dump every 40 years. |
©2024 cpdn.org