Message boards : Number crunching : WORTH THE TROUBLE????
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I am starting to wonder if is is worth the effort to run the Hadcm2n WU�s. In the last week I have had 3 of the 4 hadcm3n�s crash at the 75% decadal upload. Only 1 made it through and is now at 81%. In each case the machine was running windows 7 unattended and not doing anything else except running the model. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I've just had one of these myself though the majority of my Hadcm models behave well. I haven't understood why crashes at 25, 50 and 75% happen to a proportion of models of this type and am going to try to find out. Cpdn news |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
JIM - Someone can correct me if I am wrong, but I believe you get credit up to your last trickle and the results can be used up to that point. I also believe a new task will be created to "finish" the work you started. So, even if you have an error, all is not lost. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Credit up to last trickle - yes - zero doubt about that Results up to last trickle useful - not very much for the big long rapid-rapit wu's -- as posted recently elsewhere -- the quarter-run uploads are the more important, and the whole big run thing is more valuable than the quarters. Right now I believe that the 25% problem - and I've observed several dozen wu's that ran on my machines plus their wingpersons and restarts - Statistically - the failures at the quarter-points (including a few I've seen at the 100% - where the last upload worked and the wu was still reported as a failure) The quarter-point failures are not happening at the quarter points -- that's where the computations errors are being detected But whatever causes these errors is happening much earlier - I have done about 8 tests with doing restarts from backups, and the only reruns that succeed after a "quarterpoint" failure have to go back at least to before the previous quarter-point backup to have any chance of succeeding on the re-run. Sorry - but I didn't think this might be a problem for others, thought it was my machines - so I have no good documentation. Hope this helps - not sure if I've made it clear what Seems to be the case for me. PS my machines have successfully completed about 90 wu-s mostly hadcm3n since start of year. And about the same number of failed wu's - mostly failed downloads and broken wu's - maybe only about 10-20 out of 200 actually failed at quarter-points. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Interesting. I hadn't noticed that. Thank you, Eirik. Cpdn news |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
... The quarter-point failures are not happening at the quarter points -- that's where the computations errors are being detectedThat's the same experience as I've had. Decade failures only began appearing on my machines after backups started to be made; the backups were useless, as the restored backup would simply repeat the same error. I talked to Neil Massey at the recent Guardian University Awards event and he confirmed that the error occurs at the decade point because some data is missing at that point, but they have not yet been able to find the cause of the missing data, beyond concluding that it is some kind of timing problem. I hadn't narrowed down the survivability criterion to "before the previous decade point", but had discovered that a backup from the original download (before unzipping) would succeed if it was left alone - i.e. "before any decade point". |
Send message Joined: 29 Oct 06 Posts: 14 Credit: 99,628 RAC: 0 |
My model just stopped @ appr. 25% :( 14.04.2013 12:43:33 | climateprediction.net | Restarting task hadcm3n_zeoq_1920_40_008319747_1 using hadcm3n version 607 in slot 1 My NEW BOINC-Site Why people joined BOINC Synergy... |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Steffen Here is your crashed Hadcm. I'm not 100% convinced that your model crashed at the 25% moment. It didn't produce file _1 and it only shows 9 trickles. Also, your error code 25 is different from models that crash at the moment of the decadal file. I have seen a lot of decadal crashes with error code 22 and messages that include STWORK six times. The model tries to recover five times but the sixth occurrence is fatal. I don't know whether this is always or necessarily the case, but I have seen error 25 after the computer has not shut down properly or the user did not exit from BOINC before shutting down the computer. When you restart the computer and BOINC restarts, the Event Manager crash timestamp shows the moment when BOINC tries to restart the model, not when the model crashed. Did you turn off the computer without exiting from BOINC first? CPDN models don't like this and it can cause a small proportion of them to crash. I believe the cause is that Windows shuts down too quickly before all the model files have stopped. If you didn't turn off your computer before those Event Manager messages I am mistaken. We can see some typical decadal crash messages in stderr for the models in one of Jim's workunits. Cpdn news |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Also, you should move to BOINC version 7.0.60, which is a version with a fix for an MD5 checksum error suffered by this project. New work from the project should download OK, but re-issues of old work will fail to download without this upgrade. There's a thread here that explains what was happening, and the debugging to find the cause, if you're interested. There's a bit of "chatter" towards the end, so you need to go back a bit. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I would just add to Mo's post that the same problem can occur with Linux if models are not suspended and BOINC exited using the file exit menue before shutting down the computer. Again not every model and not every time but the hadamc3n models to me seem more susceptible in that there is a greater chance that any particular shut down will crash them and being longer unless the computer is one of those that is left on the whole time, they get many more attempts to crash them. Les, I will upgrade but am running this machine 24/7 at the moment and if I don't have to shut down or re-start will wait till the hadamc3n unit I am running is finished. |
Send message Joined: 29 Oct 06 Posts: 14 Credit: 99,628 RAC: 0 |
PC restarted automatically during the night after a windows update. I changed that, updates will only be installed then i tell windows to do so... BOINC is installed as a service, so i would have to stop boinc with "net stop boinc" before restarting the computer, i guess? My NEW BOINC-Site Why people joined BOINC Synergy... |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I've never used BOINC installed as a service, but I believe this is indeed the way to exit from it. I also want to be notified about Windows updates but my Windows preferences say only to download them when I choose. I also do the usual restart when it suits me after exiting from BOINC and never let the restart happen automatically. Same with defrag - I exit from BOINC first as this is the way to get the BOINC folders defragged as well as everything else. I also think it's safer to carry out scans after exiting from BOINC, though I must say the security companies treat BOINC more kindly than a few years ago. Norton used to be a nightmare but nobody complains about it now. I know these precautions are a bit of a nuisance but it's disappointing to lose a model you'd been incubating for months. Cpdn news |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I don�t know about you, but, the first thing I did when I installed Norton was exclude both the Boinc folders (the one in Programs and the one in the ProgramData) from all Norton scans. Better safe than sorry. |
Send message Joined: 9 Apr 12 Posts: 10 Credit: 2,700,404 RAC: 0 |
I was looking at my most recent model crashes. Each had failed on 4-6 runs by various computers. Maybe those should be taken out of the resends? I'm not sure why any of those failed really, but at some point can we just say those won't run to completion? |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Number of 'resends' can be controlled and is a current subject of discussion. The change is said to be simple enough, but potential ramifications are not. Each option to get around the problem carries the possibility of causing trouble for valid Tasks. Decisions, decisions ... "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Chris I've looked at the workunits of your two most recent crashes. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8463525 Of the five computers yours is #4. #1, Linux, got INVALID THETA. That means the model ran into impossible climate and crashed. This isn't necessarily a reason for withdrawing the WU because the results produced by Windows, Linux and Darwin are usually slightly different. Even two machines with the same OS won't necessarily produce the same error of this type. #2, Windows, got No heartbeat. This can indicate a temporary instability on the computer. I think it's highly unlikely that this was the fault of the model. #3, Darwin, is crashing all its models. This computer's daily model allowance should be stopped. I find it almost unbelievable that a person can crash over a thousand models and not get onto the forum to ask why. #4, you , Windows. I see no clear reason why the model crashed. #5 has just started the model. I can't see any obvious reason why this model should be discarded by CPDN. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8456636 #1, Windows got INVALID THETA. The model generated impossible climate. #2 is another person who should get onto the forum. #3 I think this computer has stopped running the model. #4 you, Windows, got INVALID THETA but not at the same crunching point as computer #1. #5, Windows, got INVALID THETA at a third crunching point. IMO this model is a candidate for withdrawal but they may want to see whether it makes it through the 40 years on Linux or Mac. I hope that's clarified just a little bit what was going on with these workunits and why it isn't always clearcut whether a WU should be withdrawn or not. Yesterday I was reading the web page written by the scientists researching these model results. They say very clearly that a proportion of models will fail due to impossible results (eg INVALID THETA) and that they want to know which parameter values fail in this way. So your computing time has not been wasted. Cpdn news |
Send message Joined: 8 Aug 05 Posts: 12 Credit: 24,554,040 RAC: 2,537 |
I am just following up on a moderator's comment that caught my attention: "...#3,...I find it almost unbelievable that a person can crash over a thousand models and not get onto the forum to ask why..." I'm one of those types of individuals... The "lesson-learned" for me is: I am a hardware guy. I tweak my hardware (via software, of course) to keep things running cool and efficiently; and keep peace in the family. Still, all my "tweaks" just made my statistics drop like a rock. If only I had spent some time perusing the message board weeks ago...to which I want to say - Oh well! - but the reality is, I almost "tweaked" my team off the Top Twenty the way I was dealing with my performance issues. My questions are: 1. What is the "best" method to quickly discern the cause of any CPDN performance issue? I never see any change in my benchmarks after tweaking configurations; and I don't really know how to interpret the messages in the event log. 2. Is there a tool to help estimate how configuration changes will effect performance - if one can assume some typical usage pattern for a specific machine? Such a tool or utility would have likely helped me see that nothing I was doing was improving performance; and thus, maybe, have led me to make inquiries much sooner - like, weeks ago?! 3. Is there a tool/utility for monitoring performance of individual tasks? I currently just scan the "globe," etc. Still, I never tied this to failed tasks; and, it never came to my mind to restart any failed task from a backup if it failed. I assumed once a task fails, that was the end! Apparently not, eh? Best regards, Michael O. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There are no tools that I'm aware of, only one's brain. :) There are a few 'ways' to keep a check: 1) Regularly look at the models running in the Tasks tab. ALL of them take at least 2-3 days on the very fastest computer, so any that don't stick around for several days is suspect. For the longer Coupled Ocean models, it's 2-3 weeks. 2) Better - look at the Tasks list on the server for each computer. There's a column that labels the end result. If there's a large build up of Error while computing, then something's wrong. 3) This is an "also" - keep a check on the News and Announcement thread at the top of Number crunching on this board. And the best way to do that is to subscribe to it, which will get you an email when there's a new post. Working out what has gone wrong can be done by looking at the Stderr log on the model's server page. Click the + to open it. If necessary, ask about it on this board, proving a link to that page. Or at least the computer number and model number. It's time consumming to have to search through lots of computers with lots of models to try and find the one that matches a vague description. :) |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Geophi who has experience of both Windows and Linux has said that models tend to run a little faster on Windows. But I don't think it's usually worth spending time overclocking to try to speed things up unless one's also prepared to spend plenty of time checking the machine's stability, because unstably O/C'd machines are more likely to crash their models. Cpdn news |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
2. Is there a tool to help estimate how configuration changes will effect performance - if one can assume some typical usage pattern for a specific machine? Boinc's built-in benchmarking tool has given me some very strange results. It's best treated as a rough indication. The best method requires a calculator:- Go to the web page for a task that has trickled twice since the tweak. Calculate (CPU time of last trickle - CPU time of previous trickle) divided by (Timestep of last trickle - timestep of previous trickle). Compare the result to the same calculation for another pair of consecutive trickles, both of which were before the tweak. Do this for several tasks on the same computer. CPDN tasks can speed up and slow down at different stages in their "lives" -- but not always at the same stage. Of course this method is not instant. My i7-2600 running at stock takes about 12 hours between trickles for a HadCM3N, so doing this measurement typically requires a wait of 36 hours. (The event log will tell you when a trickle for one or more tasks has been sent; unfortunately it won't tell you which tasks. You have to check on their web pages.) The best way to maximise your credits is to focus on stability first and foremost. Run 24x7. Run at stock. Don't run flaky software on the same box, and uninstall everything not essential. Don't allow Boinc to suspend tasks automatically -- do it manually when required. Use a good power supply. (And train visiting toddlers not to touch the oh-so-tempting big red switch on the mains supply box outside your door... :) ) As for backups: in the days when computers had one core, backups were a good idea. Nowadays it would be very difficult to restart a crashed task without affecting other tasks on the same box. I say "affecting", I mean "completely trashing". I recommend not backing up the Boinc data folder. |
©2024 cpdn.org