|
Questions and Answers : Windows : Comments for \'Generic solutions to models\' sticky
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
Send message Joined: 1 Sep 06 Posts: 11 Credit: 4,627 RAC: 0 |
Hi, I have since downloaded the Seasonal Attribution as well and at one stage both were running fine. I also started using Prime95 and did a torture test before starting to search. No problems running it. Yesterday I downloaded new drivers and installed. Now all my projects are running fine. I noticed the first improvement when I changed my target CPU temperature settings to 40C (AMD), so it does not reach 50C (before using Prime95). Then the projects detached. Happened a few times with just Climate Prediction. Not sure why. I also made some other changes to the registry, but think the temperature setting fixed it as other projects also caused it to hang at one stage. It is early days though, but things look fine now. Thanks |
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Tobie: Excellent, hope it continues to run :-) Rick: Have you tried Prime95\'s torture test for a day or so? This will let you confirm that the hardware is OK under stress. The model is pretty much the most stressful software the PC will ever run. I'm a volunteer and my views are my own. News and Announcements and FAQ |
![]() Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
. . . in adition to which, some of us have been running this \"dicey\" scientific software on a variety of hardware, on different flavors of MAC/Windoze/Linux, with few problems. Those are the tens of thousands of participants you don\'t hear from on these Boards. I\'ve personally had software troubles -- to be sure -- but they were in CPDN Alpha and Beta tests. The rest? Power problems, hardware issues, self-inflicted wounds. (I\'m old, so I don\'t mind admitting personal imperfections.) "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
![]() ![]() Send message Joined: 12 Dec 05 Posts: 4 Credit: 93,834 RAC: 0 |
Tobie: Excellent, hope it continues to run :-) Hi Mike, Thanks for the suggestion. I waited for CPDN to, uh, disappoint again (got to 6% this time) then downloaded & ran Prime95. 53 hours, no problems. Whatever is going on here, it just passed my time budget for tending to a screensaver. Good luck with the climate prediction for those who remain. I\'m off to other BOINC projects. -Rick |
![]() ![]() Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Rick, on looking at your results I find that 3 of your past/crashed models show -161 errors. These probably happened because of some incident/event on your computer. Have you tried to identify from Mike\'s top post in this thread what the problem(s) may have been? 4 of your past models, including the one that\'s just gone down, crashed with -107 codes. This usually indicates a graphics problem, momentary or longstanding, on your computer. Have you gone to the link Mike provides in his second post up above? Have you, as Thyme Lawn explains in the link, updated your graphics card driver (free download from the web a bit like Windows updates)? We advise everyone to back up their climate models. The models take upwards of 4 months to complete (up to a year on my computer), running 24/7. It is fairly likely that something will go wrong on the computer during such a long period. So these long workunits/simulations are only suitable for committed members willing to make regular backups. Cpdn news |
![]() ![]() Send message Joined: 12 Dec 05 Posts: 4 Credit: 93,834 RAC: 0 |
Hi Mo, Thanks for the research, but I never doubted that the problems involved my computer. I\'m sorry if people are taking my comments personally. From the early days of SETI@home, my understanding was that these were programs that could use my computers\' idle periods, with negligible human input. I am definitely not a \"suitable committed member\" willing to manually do checkpoints & restarts and shield the program from other tasks. As I said, I have retreated to the other BOINC projects with their Load/Launch/Leave paradigm. Best wishes, Rick Rick, on looking at your results I find that 3 of your past/crashed models show -161 errors. These probably happened because of some incident/event on your computer. Have you tried to identify from Mike\'s top post in this thread what the problem(s) may have been? |
Send message Joined: 23 Sep 04 Posts: 1 Credit: 91,424 RAC: 0 |
Hmmmmf. Hi, I would suggest switching off the screen saver, as it is a waste of your computer\'s energy. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Rick and others unhappy with this program: This is NOT a simple little toy that\'s been designed for the amusement of the public. It\'s a huge (1 million lines plus), Fortran program, written over many decades by many software engineers, and normally runs on the Met office\'s supercomputers to predict both weather and climate. Getting it to work on a desktop computer took a couple of years, and it does NOT like working with Windows and the thousands of different programs that people use at the same time. Of course it\'s \"fragile\"! Which is why any computer that attempts to run it needs to be VERY stable. Where the definition of \"stable\" is the ability to run these climate models without crashing. And there are a lot of us WITH \"stable\" computers. If your computer hardware/software isn\'t capable of running the models, then there are some simpler, easier projects listed here. |
![]() ![]() Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I\'m not even certain that the climate model workunits are more fragile than workunits for any other boinc-based project. If you run a workunit for some other project and it needs 24 hours of computer time to complete it, the likelihood of it crashing will be X. A climate model running on the same computer might take 5 months to complete ie 150+ days. The likelihood of a model crash during this time will be X multiplied by 150. This is obviously quite a high-risk venture, so taking the precautions Mike describes above is well worth while. Backing up is also essential to avoid disappointment. Cpdn news |
Send message Joined: 7 Dec 06 Posts: 3 Credit: 544 RAC: 0 |
I\'m afraid that I have to agree with the above post. Being a seasoned software engineer for 20 years (5 years of which was at the Met Office in Bracknell) I know that you cannot write code perfectly - If you are talking about 1 million lines of Fortran code (god I hated using Fortran when i worked there) then there are bound to be errors in the code - it is just infeasible that the code created is 100% perfect so blaming peoples pc\'s for the problem is a little harsh, it is more likely that a certain combination of data is forcing the code down an unexpected path and causing a crash. The work packages are enormous - I used to run the project but gave up after having it restart itself back at 1920 3 times. I appreciate that it may be difficult to do but reorganising the code to allow for smaller chunks of work - or at least to fix a \"last known good state\" at regular intervals would greatly increase the amount of results you are receiving. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
\"smaller chunks of work\" will hopefully become possible sometime next year, once some code is written and tested. This is what the \"restart dumps\" every 40 years are for. But currently, the core team have higher priority matters to deal with. One of which is to solve the problem of the \'climateapps2\' server falling over because of the huge work load. Then there\'s the programs that will allow the global researchers to access the data for their own analysis without causing problems for the storage servers. These people are some of those \'paying the piper\', so they get a bit of a say in the priorities. As for blaming people\'s computers, if you read back throught the posts of the last couple of years, and the replies to them, you\'ll see that a lot of people have been trying to run the program on seriously under-powered machines, which they think will work \'because SETI works on them\'. Such as only having 128 Megs of ram, a cpu speed of 98 MHzs, an OS of 98ME, etc. And you\'ll also find that a lot of people with \"faulty computers\" have been helped to get the program going with just a bit of advice. One such from way back was a person who \'laid down the law\' about how it was the program, and not his computer. It turned out that he\'d had a psu failure, opened the case to replace it, then hadn\'t replaced the cover. After some advice from \'yours truly\', as they say, when he cleared away a few things so that he could see into the case, he found that the cpu heatsink was covered with a thick layer of dust, from a \'year of neglect\'. Or so. Last I heard, his computer was crunching away successfully, and he was a \'happy little vegemite\'. (Local saying.) So, how may we help you? (Starting with some advice that you can obtain a \'better\' name for yourself by changing it in the preferences on your Account page.) PS \"last known good state\" IS written \"at regular intervals\". But making sure that the tyres on your car have a good tread, and are inflated to the correct pressure, is no use if you get hit by a loaded out of control truck. And similar things can happen to the model/program which makes the checkpoint useless. Which is where backups come into the picture. As an IT person, you\'ll know about the value of having everything safely backed up every night. Backups: Here |
Send message Joined: 7 Dec 06 Posts: 3 Credit: 544 RAC: 0 |
\"smaller chunks of work\" will hopefully become possible sometime next year, once some code is written and tested. This is what the \"restart dumps\" every 40 years are for. of course - this is understood - business as usual tends to take precedence over this sort of thing.
agreed - but not everybody - and as was posted here - many people who have had problems have simply not bothered to post. I myself have crashed as i say and my pc is pretty stable 3.4ghz dual cpu with 1gb ram + 1 terrabyte of disk storage. It may not have been your intention but the responses came across as \"well it can\'t be us it must be your fault\" which is going to get peoples backs up a bit.
the joy of users
I will get around to this at some point soon
mmmm - well - my definition of regular would be at least once per day. If the last known checkpoint can get destroyed then it is not an effective checkpoint and how it is done needs to be rethought. I work in the telecoms industry now where we process realtime telephone data (10\'s of millions of items of data every day) - everytime a process crashes it HAS to know how to recover otherwise people lose revenue (lots of revenue) so i know these procedures are difficult, but not impossible to accomplish.
I understand this - but you have to realise human nature - probably 95% of the people who attach to this project did so thinking - \"ooohhh, that\'s a pretty screensaver and im doing a bit for the environment as well\". It is also a screensaver, which by definition is hands off, you are simply not there when it is running. People are not going to remember to do backups, they just aren\'t going to remember. By insisting on this policy (and with it being unstable as it is it does become mandatory) you are restricting yourself to the IT literate amongst us. I am not having a go at the project - I appreciate what it is trying to achieve but you have to accept that most people are just running this for fun with the hope that it helps understanding of climate change. If it becomes \"difficult\" they are just gonna quit. This is the climiteprediction.net\'s loss as one more person detaches from the project to run one that is more stable. I will try and stick with the project and try and remember to do regular backups but it is the most frustrating screensaver I have ever used :-) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Nazarine Just to clear up one point first. This program is NOT a screen saver. That concept went out the window when BOINC was created back in 2004. However all of the projects still provide screensavers for people who want to look at the pretty picture for some reason. (Perhaps to get the attention of visitors, who may then get interested in running it themselves.) What it is now, is a low priority background process. The core team, as well as us \'old timers\', do realise that lots of people run it just for fun, just as lots of people join a marathon \'run\' just for fun. I don\'t know about the UK, but here in Sydney, we have a City to Surf (14km), in August. This has a category for fun runners; people with babys in strollers, army groups in full uniform, people in gorilla suits, etc. I\'ve considered it myself a few times, (in the walking \'group\'), but only as far as the bottom of Heartbreak Hill, which is a real killer. Literally in a few cases. If people want to run these climate models for fun, that\'s fine. A lot of people have started out that way on the BBC part, and then given up. Then decided to complain that the BBC said that anyone could run it. After a few questions and answers, they suddenly realised what they were doing wrong, and have continued. Some have become so good at it, that they are now giving advice to other newcomers. Back in about 2003, the researchers had hoped that perhaps they might get as many as a thousand people running the models, and hoped that they might get a couple of thousand of the original slab models, (simple static ocean). They eventually got back over 170,000 of them. Currently, they are hoping for 5,000 of the TCMs, and at the moment have 3526, with, apparently, more than 30 coming in each day. mmmm - well - my definition of regular would be at least once per day. If the last known checkpoint can get destroyed then it is not an effective checkpoint and how it is done needs to be rethought. I work in the telecoms industry now where we process realtime telephone data (10\'s of millions of items of data every day) - everytime a process crashes it HAS to know how to recover otherwise people lose revenue (lots of revenue) so i know these procedures are difficult, but not impossible to accomplish. The checkpoints are written once every 6 model days, which is about every half hour on my 3.2 GHz P4. It was once per model day in the previous simpler models, but people complained about the too frequent hd access. I had hoped not to have to go throuigh this typing again, which was why I simplified it. However. The model data is saved once every 6 days, and various other data are also saved, such as atmosphere restart, ocean restart, day restart, month restart, and year restart. BOINC also saves it\'s work log regularly, in client_state.xml. When the program is shut down, the model starts again from the last checkpoint. If a problem with unrealistic values is encountered during running, the program rewinds one model day, and tries again. If the problem is encountered again, it rewinds a model month and tries again. Finally, it will rewind a model year and try again from there. If the problem was a momentary computer problem, the model should, during one of these retries, get past the failure point. If it doesn\'t, the model will be aborted, BOINC will upload some info to the servers, and then request a new model. However, BOINC and Windows don\'t get along too well. Some Windows programs, for instance, take over the port used by boinc.exe (the \"hidden\" worker part) to \'talk\' to boincmgr.exe, (the visible part, or gui, which displays what the worker part is \'up to\'). Sometimes when this happens, the model just \"disappears\". After all, if the gui can\'t \'see\' the worker, it can\'t display anything. (This can usually be cured by menu/Exit from the gui, then re-booting the computer.) Sometimes, however, for various reasons, usually hard to pinpoint, BOINC loses track of the \'child process\', (the climate program itself), so it thinks that this has stopped. It then assumes (I know, bad word, but it\'s the best there is), that the model has completed, sets the percentage-to-completion to 100%, ticks the model off it\'s list of \'work units in the queue\', and tries to upload the final zip files. Which don\'t exist, so it returns a 161 error; file not found. As the model is no longer listed in it\'s work queue, BOINC has now forgotten about the existance of the climate model, which is why the only way to get it back is from a previous backup. All suggestions for improvements to BOINC should be directed to the BOINC site, at the University of California, Berkeley, here. If I still haven\'t explained something, please post again. |
Send message Joined: 7 Dec 06 Posts: 3 Credit: 544 RAC: 0 |
Sorry about this - my final post of the night I promise :-) Does the checkpoint handle completely unexpected results, i.e. if i suddenly had a powercut would it recover from the last checkpoint (my pc is similar spec to yours so from about the previous 30 minutes worth of processing)? This is as opposed to an out of bounds value turning up in the data. If it does this why would it ever restart back at 1920? If it does this because it has reached a data dead end (i.e. the model has realised it is spiralling unrealistically out of countrol and decides to abort) is the data produced up to that point useful? If the problems are being caused by PC hardware failure you would expect the checkpoint system to be able to rewind to the recent checlpoint position and just carry on regardless - hardware problems are extremely unlikely to occur in the same processing place and so you are always going to steadily increase your progress (albeit with the odd slight rewind). Thanks for your patience in answering my questions - I am running the Prime95 Torture Test at the moment to ensure I have no hardware issues to explain my resets and as i say i will be attempting to do backups at regular intervals. It is way past my bedtime (1:45am) and I need sleep now. |
![]() ![]() Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Nazarine When you refer to your \'resets\' what do you mean? Have you had climate model crashes? On your project pages I can only find one model that you\'ve had. If you\'ve reregistered under a different name but had model crashes previously, we can point you to advice about how to avoid the same problems again. The model never goes back to 1920. If the model crashes, its data get uploaded to the server and a new model downloads. The way to continue processing the crashed model is to restore a backup, which is almost always successful. At the moment there can\'t be a backup/restore button/facility within the boinc manager because to use it you\'d have to be using the open boinc GUI, but backups only work if you back up the whole boinc folder - which means that before backing up you have to exit from boinc itself. So the person has to make regular backups, for which our README compilations describe several methods. Auto backup methods also work, as long as you remember to exit from boinc beforehand. Carl, who\'s the chief programmer for the project in Oxford, considers the models to be extremely stable. Almost the only instability problem emanating from the models themselves is that some of them turn out to \'loop\', ie at a certain point refuse to crunch forward in time. Fortunately, the data created up to that point is useful to the researchers. As Les has said, most model failures are due to *hardware that doesn\'t meet the minimum specs, particularly as regards RAM *members who are unaware of the basic precautions required during normal computer use to protect the model *catastrophic or unexpected hardware failure, external events, or software events Regular backups are the ultimate line of defence against all of these. This is where we are now. Cpdn news |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Does the checkpoint handle completely unexpected results, i.e. if i suddenly had a powercut would it recover from the last checkpoint Yes. At least for me. The half dozen that I\'ve had range from a brownout that lasted a bit too long, to a 2 1/4 hour power failure. (The latter was in summer. And in the 5 minutes that it took me to get a torch, look up a number, find the old rotary dial phone, then an adaptor, and get on to the power company, there was already a recording saying that it would last about 3 hours. I think that someone made a boo-boo at a nearby substation.) I was using the computer at the time, when everything went black. It took a few seconds to get over the shock and work out where the nearest torch was. I even had one while I was out shopping. I knew this because the F Lock led on the base unit for the wireless mouse/keyboard was on, so the computer had rebooted. A check of the times in the messages of the gui told me when it had happened. On other occasions of a short failure, (nearby electrical storm), while I\'ve been using the computer, I\'ve been able to see/hear what happens. First the click as the power supply turns off, when the display going black, and the hds winding down. After a few seconds the computer powers up, followed by Windows, then BOINC, and finally the model. But I DO have a larger than normal, 400W psu. When I built the computer, I went for overkill, (at a reasonable price), so that in 10 years time it would still have a chance of coping with the programs available then. But now we have Conroe, and a different philosophy for designing the cpus. Still, my machine is rock stable. (Helped by not being near an earthquake zone. :) ) PS I should have mentioned the 4 README files here, which are full of hints and tips, or links to them. |
![]() ![]() Send message Joined: 29 Jun 06 Posts: 4 Credit: 1,883,202 RAC: 0 |
>>... and the second is a -161 error in the log. I\'ve this error more than 4 times on one machine. It\'s a AMD 64 under XP Pro. >>* If you use Norton or Sophos antivirus, No. >>* Before playing games or other heavy duty applications ... Thats possible, sometimes I run Games. >>* Run a stability test on your machine... That\'s all ok on my computer. >>* Overheating can cause instability... Overheating was\'nt the problem. * People who have overclocked their machines... Nope. >>* Make backups about once per week Make it daily, but when I make a replay ... Groundhog Day ... >>* Watch out for firewall messages ... Firewall was\'nt the problem. >>* Windows \'time sync\' messages ... hardly probable >>* The benchmark boinc runs every 5 days ... Not the problem. >>* The Memory requirement for XP machines is now 512MB 1 GB The Error is only on a AMD Athlon 64 under XP Pro. The other machines are Sempron 3000, Sempron 3200, Athlon 3000, T2400 (all XP) and one AMD 64 but under Win 2003 Server. They are not concerned. Oh, by the way. On this machine CPDN runs when stopped with 50% of the rest of capability. Normaly I see 45% in taskmanager when stopped. CPDN can\'t be stopped by user or other tasks, it\'s run everytime. That\'s not normal. ![]() |
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Regarding the problem with the task still running when Boinc is shut down, this can happen if there are problems with the local network stack (you might notice that your webbrowser sometimes seems to \'stick\' on that PC). If there are any unnecessary network software running, try to get rid of it (I uninstalled two different VPNs on my box, and that solved a similar problem). Could you provide a link to the PC you\'re having trouble with? You have several in your list. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 1 Sep 06 Posts: 11 Credit: 4,627 RAC: 0 |
I\'ve now run CPDN on and off, sometimes for a few days and other times for half-an-hour or maybe a few hours before it freezes up. It always starts up again after reset, so no problem loosing WUs rebooting. I\'m still trying to figure out why I get the problem though. Maybe it has to do with dual core running two projects all the time or my motherboard/RAM combination. Sometimes my other projects also freezes the computer, but not as easily as CPDN. |
![]() ![]() Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
By freezing up, do you mean that your PC locks solid and you have to power-cycle it to get control back? If so, there are several possible causes: * CPU Overheating (usually slows down a bit first and then powers off itself) * Memory faults causing the PC to fail (usually blue-screens) * Insufficient memory (everything slows down as if the PC was in treacle, constant traffic on the hard disk, and switching between applications becomes incredibly slow). I think the third is most likely on your machine. The recommended memory to run the model is 512MB, and you have less than half of that. I would recommend disabling the CPDN model until you have more memory or maybe a new PC in the future - there is a risk of hard disk damage from the continual use of the paging file. You could also try trimming down your machine a bit to remove anything which is unused from memory (i.e., go through the system tray and uninstall or disable anything you don\'t actually need). This will free up a little space, but it\'ll be a very tight fit even if you weren\'t running any other projects. I'm a volunteer and my views are my own. News and Announcements and FAQ |
©2025 cpdn.org