Questions and Answers : Unix/Linux : What is the meaning of this?
Message board moderation
Author | Message |
---|---|
Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0 |
It would seem my CPDN just lost a months worth of results. Why is it doing this? You can see the line before and after the rewinding-message, nothing extraordinary happened. 4843_200297411 - PH 1 TS 0004027 A - 24/02/1811 21:30 - H:M:S=0004:41:59 AVG= 4.20 DLT= 2.85 Preparing for restart... Rewinding a model-month... Copying restart files for model retry... Starting model ID 4843_200297411 Phase 1 Waiting for model startup, this may take a minute... 4843_200297411 - PH 1 TS 0002881 A - 01/02/1811 00:30 - H:M:S=0004:42:12 AVG= 5.88 DLT= 0.00 |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
In certain types of model instability, the model will go back to a hopefully known good point, in this case at the start of the last month, and start from there again. This gives it a chance to continue on after an error. If the error was just some odd hardware glitch that doesn't reoccur, then it will continue on OK. If the model is unstable, or the computer is unstable again, it will give up and download a new model. Usually it rewinds a day, then a month, then a year. You may not have noticed the rewind a day messages. |
Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0 |
> In certain types of model instability, the model will go back to a hopefully > known good point, in this case at the start of the last month, and start from > there again. This gives it a chance to continue on after an error. If the > error was just some odd hardware glitch that doesn't reoccur, then it will > continue on OK. If the model is unstable, or the computer is unstable again, > it will give up and download a new model. Usually it rewinds a day, then a > month, then a year. You may not have noticed the rewind a day messages. Oh yes, I noticed the rewind a day message too, that just happened some time before. And yes, here we go again: 'Preparing for restart... Rewinding a model-year... Error: Restart files for dataout/restart.year not found Giving up, this result exceeded crash count for available restart files.' The EXACT same thing happened the first time I tried CPDN, so many months ago (march 2005). Apparently, it's STILL not fixed. :-\ First it rewinds a day, then it rewinds a month, then it tries to rewind a year, but can't because it hasn't gotten that far yet and then gives up. It has even happened so fast it couldn't even rewind a MONTH! I can understand S@H is not comparable to this, because a S@H-WU can be finished in one day, but an E@H-WU takes up several days too, so why can those WU's make it through that time whilst not crashing and not a CPDN-WU? I am very careful to suspend BOINC before shutting down, so everything can safely re-start the next time but it would seem that still isn't enough. I can't track the evolution of the SC-application, but the hadsm-application seems to have evolved 2 versions since that time. I'll abort SC, since it's either that or restart and probably face it crashing again. I'll see if the new hadsm3-application fares better than it's predecessor, although it looks doubtful. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I see that you don't have any trickles recorded, even after a month. Are they all still in your climateprediction.net folder, or have you done a computer merge and had them allocated to another ID? Have you read / tried the maintenance / stress testing written by UK_Nick? |
Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0 |
I see that you don\'t have any trickles recorded, even after a month. A small situation report: My machine has 2 OS\'s, Window$ XP Home and Linux (actually there\'s two Linux OS\'s, but we\'ll treat them as one). On both OS\'s there\'s BOINC installed. Window$ has become rater unstable (as all Window$ do), but I only need it to play certain games and it can still manage that, as long as I save often enough. When I try to run that BOINC, the OS crashes. Since I managed to make one of my Linux OS\'s a DVD-player I don\'t start it up that often anyways. On the Window$ BOINC there\'s some S@H WU\'s (behind deadline) and CPDN WU\'s (before deadline) present but I can\'t finish them for the the reasons I outlined above. On the Linux-BOINC I first had S@H, then CPDN, abandoned CPDN (full detach), then E@H, and now re-attached CPDN. The reason why I don\'t have any trickles recorded is rather obvious: none of the WU\'s I (try to) process can run long enough to produce any trickles! In Linux the WU crashes (as you\'ve seen) and in Window$ the OS crashes. I have done no computer merge on my CPDN account, only on the S@H and the E@H-account a short while ago. Strange, the number of computers seem to match the number of CPDN Wu\'s on each OS, but I guess that\'s a coincidence. I\'ve got a great deal of client errors on my result-page. Some of the (non errored) WU\'s are from before I detached (and thus I don\'t have anymore), some are on my Window$ partition, and two are currently in my Linux BOINC CPDN-folder. Currently BOINC is in EDF-mode and since there are two E@H WU\'s there, which obviously have shorter deadlines than the CPDN-WU, they are being crunched first. maintenance / stress test? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
<a href=\"http://www.climateprediction.net/board/viewtopic.php?t=2124\"> Maintenance</a> <a href=\"http://www.climateprediction.net/board/viewtopic.php?t=2126\"> Tests</a> edit: Sorry about the long strings. Carl has updated the server software. |
Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0 |
<a href=\"http://www.climateprediction.net/board/viewtopic.php?t=2124\"> Maintenance</a> You DO know this is a Linux-forum and those links are to exe-files, right? At least one of them seems rather un-wine-able. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
You DO know this is a Linux-forum and those links are to exe-files, right? At least one of them seems rather un-wine-able. And you said you have both Windows and Linux on those PCs. Testing hardware to see if it\'s reliable in Windows should suffice for determining hardware stability in Linux for CPDN. And, while Prime95 Windows executable might be linked from that thread, there is a Linux binary for Prime95 as well. |
Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0 |
And you said you have both Windows and Linux on those PCs. Testing hardware to see if it\'s reliable in Windows should suffice for determining hardware stability in Linux for CPDN. And, while Prime95 Windows executable might be linked from that thread, there is a Linux binary for Prime95 as well. I believe I also said Windows XP is only on here to play certain games, and since I\'ve still got lots of DVD\'s I need to watch on this pc (I\'ve only got ONE pc, despite what it may say in my profile, and no stand-alone DVD-player) I rarely start it up anymore. I\'ve got something monitoring my hardware in Windows, from when I was trying to determine why I was unable to (re-)install one of my two Linux OS\'s, but it seems some buggy sectors on the XP partition were the cause. In fact, that\'s MBM version 5. However, although I tried, I was unable to set up logging, so I can\'t determine what went wrong when that OS (XP) crashes. As long as it didn\'t crash, all parameters were within safe range. Also, yesterday, the linkt to the Prime95-tests was unavailable, and I searched but could not find the Prime95 Linux-version whilst googling. Also, if something goes wrong: a) shouldn\'t S@H and E@H suffer too and b) can\'t it (CPDN) tell me WHAT\'s wrong in the logfile (or at least nudge me in the right direction) BEFORE it dumps the restart/rewind-message? |
Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0 |
I would also like to add that Prime95 DID find discrepeancies whilst running its torture test but a) they were found very quick, much quicker than even CPDN gives a rewind message and b) alongside the torture tests an E@H WU was crunching and it did not give even the smallest hint of something going wrong! |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I would also like to add that Prime95 DID find discrepeancies whilst running its torture test but a) they were found very quick, much quicker than even CPDN gives a rewind message and b) alongside the torture tests an E@H WU was crunching and it did not give even the smallest hint of something going wrong! CPDN stresses both processor and memory. In particular, the memory is stressed much more in CPDN than in any other distributed computing project. If you have errors in Prime95, I have no doubt you will also eventually error out in CPDN. |
Send message Joined: 24 Feb 05 Posts: 28 Credit: 121,749 RAC: 0 |
In particular, the memory is stressed much more in CPDN than in any other distributed computing project. If you have errors in Prime95, I have no doubt you will also eventually error out in CPDN. In that case, there\'s probably something wrong with my memory, which doesn\'t affect S@H or E@H because they don\'t take up as much memory as CPDN, but in CPDN eventually it does, and always at about the same time. Guess I\'ll have to suspend CPDN then, until I can get my memory modules checked and (probably) replaced. |
©2024 cpdn.org