Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 57 · 58 · 59 · 60 · 61 · 62 · 63 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Each model has a lot of files open that need to be saved. If a model is in the process of check pointing at the time of a crash, then what's on the disk will be part old save, and part new. When it tries to start up again, the files don't match, so the model can't start. The more models that get crammed into the newer computers with their large number of processors, the more likely it will be that there will be constant check pointing. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Each model has a lot of files open that need to be saved.I can see how that would fail, but that's because it's a very bad design. The next checkpoint should be saved to a 2nd file, then the old one deleted, then the new one renamed if necessary. Almost every program does this, including say a word processor. When you click save, you can see a temporary file appearing for a fraction of a second. If the power is cut off while that is happening, the original file is not destroyed. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Surely Windows waits for Boinc to close all files first? In the case of LHC, I get an impatient warning from Windows saying Virtualbox has "active connections". I can click shut down anyway, or cancel. I click cancel, then watch in the task manager until I can see no processing, disk activity, or network activity is happening, then shut down again. Or sometimes I remember to close Boinc first, wait until zero activity, then shut down Windows. Windows still claims Virtualbox has "active connections" which I just ignore. I'm not going to go into Virtualbox itself and mess around to stop stuff. LHC tasks seem to be quite robust though, I've only ever seen them go wrong if the system crashes. And once when a hard disk was failing, although not producing errors, it was going very slowly, so the LHC tasks were giving up waiting for disk access and saying "computation error" in Boinc. It took someone in LHC to tell me they'd seen a certain error in the log file so I knew the disk was too old and tired!Same happened with two on a working machine, which I rebooted cleanly. Should Boinc not gracefully shut down running CPDN tasks itself?You are right, BOINC should restart the task from the last checkpoint reached. In the past, my memory is of this being a bigger problem with Linux tasks but I haven't had a problem with it recently, even when I have updated the Linux kernel which requires a reboot. My experience a few years ago was that a kernel change combined with a reboot greatly increased the chances of tasks crashing. I can't have power problems on my main machine as it's protected by a UPS, but I haven't gone to the expense of a larger one for the 6 Boinc-only machines. If the UPS goes onto battery mode, Boinc immediately suspends to make the battery last longer, and the monitors are turned off. If the battery is almost empty, the PC hibernates, so any tasks should resume where they left off. Although I rarely get proper powercuts, I do get the odd fraction of a second powercut. And the voltage varies from 241 to 256V (it should be 230V), so the transformer in the UPS levels that off. More to protect the house lighting actually, since I had a lot of LED lights fail with bad voltages. And because once a 0.5 second powercut caused a bad corruption of the system disk and I lost a few documents. The cheapest thing to do is buy a second hand UPS with a busted battery, and replace it by connecting it to two (if it needs 24V) or more large car batteries. Much cheaper than the sealed ones that come with it, and you get a lot of run time! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Whatever the fine details are, it works. I tested this a few days ago with a new test model, and it worked perfectly. So it's your computers and the way that you use them. And any one else silently having the same problem. Just something that will have to be lived with I guess. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Whatever the fine details are, it works.I can't think of anything unusual that would cause this. It's either a computer crash or power failure while CPDN is running, or me simply clicking restart in Windows without closing Boinc first. I would imagine most folk experience these things. It's a small chance it will happen, and usually only 1 or 2 of the running tasks break, but larger if it was a hard crash/power off instead of just rebooting. If you deliberately power off a Windows machine to simulate a powercut for example, I'm sure after a few times, some of the tasks would cause a computation error. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
If you deliberately power off a Windows machine to simulate a powercut for example, I'm sure after a few times, some of the tasks would cause a computation error. I have seen hard resets i.e. power outage or by hitting the power switch crash tasks on more projects than just CPDN. However, while not having coded for decades and never anything as complex as the climate models or BOINC itself, I still say that it should be possible to write the code so that even then it can resume after a checkpoint. I accept that that might mean more disk usage during computation. The problem is that most of the code used in the executeable files for CPDN is over a million lines of fortran (before compiling) that comes from the met office and is used under a license that does not allow the sort of playing with the code that might be needed to resolve the problem. It will be interesting to see what happens with OpenIFS eventually as that code is open source. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
A million! Ok, we should be happy it works at all!If you deliberately power off a Windows machine to simulate a powercut for example, I'm sure after a few times, some of the tasks would cause a computation error. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
A million! Ok, we should be happy it works at all! Yes, when thinking about what we would like the code to be like, I am reminded of the story of someone in a remote rural village asking for directions. The local thinks for a minute before replying, "Well if I wanted to go there, I wouldn't want to be starting from here." Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
So in that story, I'm the innocent person asking for directions, and the programmers are the ones giving deliberately obtuse answers to evade the problem? :-PA million! Ok, we should be happy it works at all! And is that from a Monty Python sketch? I've heard it before somewhere. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The programmers are in the UK Met Office, who aren't involved in this project. So you're just making things up. And no, the quote is Not Monty Python. It's much older than that. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
And no, the quote is Not Monty Python. It's much older than that.Earliest reference I found to it being in print with a cursory web search was 1924 but it may be quite a bit older even than that. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
Patience, I suppose. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
The programmers are in the UK Met Office, who aren't involved in this project.I'm not making anything up, I didn't say who I was having a go at. But I have no idea why you used the quote. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I probably heard someone else repeating it, a stand up comedian or something.And no, the quote is Not Monty Python. It's much older than that.Earliest reference I found to it being in print with a cursory web search was 1924 but it may be quite a bit older even than that. |
Send message Joined: 31 Aug 04 Posts: 7 Credit: 56,560,127 RAC: 9,316 |
Thank you guys who overclock or have failed tasks. I am getting tasks with 2 or 3 at the end and i run them okay. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Thank you guys who overclock or have failed tasks. I am getting tasks with 2 or 3 at the end and i run them okay.Some might be mine. In which case you owe me a pint. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
For Linux users, there may be another new model type on its way. HadSM4. These are similar to HadAM4 but using a slab model of the ocean rather than surface temperatures. First six month runs of these seem to not have problems in testing. (Five have completed. My five have about four hours to go.) I would guess there will be an official notice about these closer to them appearing on the main site. Time scale for these is at present anyone's guess but as Les said in another thread, "Don't forget to keep breathing." Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
There are more UK Met Office HadCM3 shorts available. I will leave them for the Windows users. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
There are more UK Met Office HadCM3 shorts available.Well that didn't last, I just told 6 machines to grab some, and none are left. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
There are more UK Met Office HadCM3 shorts available.Well that didn't last, I just told 6 machines to grab some, and none are left. ... I'm not sure there were any, as there haven't been any additions to the work unit list. |
©2024 cpdn.org