Message boards : Number crunching : Computation error Weather At Home 2 (wah2) 8.24 on WIN7 64-bit
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 17 Nov 08 Posts: 5 Credit: 1,405,081 RAC: 57,350 |
I've had them too (using linux here), and my suspicion is that when a computation is interrupted (either by BOINC manager cycling tasks to another project, or by system shutdown) the chance for a computation error goes from near zero to very likely. I wish my boinc manager had a way to mark tasks priority. That way I could update and get more tasks on another project without the likelylyhood that it will suspend and error out one (or mulitple) of my weather model tasks. |
Send message Joined: 16 Mar 16 Posts: 6 Credit: 858,545 RAC: 0 |
... here we go again ... I'm adding my five cents of experience ... Been running long WUs on three PCs WIN7 for, lets say, six to eight hours elapsed time. These WUs have an estimated run-time of approx. 4 or 9 DAYS! I am also running short WUs (max. 2 DAYS). After SUSPENDing the long TASKs due to CPU overheating problems, they tend to error out after RESUMING. But on the otherhalf (correct english?), after having to SUSPEND the whole PROJECT because of (don't laugh) weather problems (lightning etc.) here at my site, the short WUs run OK. The short and long WUs are all the same application (see title). I wonder if there is a difference after RESTART in behavior between SUSPEND TASK and SUSPEND PROJECT ? Kind of kills my thrive to crunch ... At present, I'm watching two more long WUs that are STILL running, despite SUSPENDING several times. I hate to see my power bill ... Have a nice day - where ever you are! |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,884,997 RAC: 4,577 |
[San-Fernando-Valley wrote:]... here we go again ... I have never understood why the models have so many problems with restarts. The model writes a "checkpoint" at regular intervals and restarts from the checkpoint file. From time to time there will be an error writing the checkpoint file or reading it back, in which case the model ought to invalidate the previous checkpoint file only when it knows it has successfully written the new one. Anyway, we are where we are. So to prevent restart errors: 1. In BOINC Manager, select the "leave non-GPU tasks in memory while suspended" in the "disk and memory" tab of the "computing preferences" dialog (English language version). That reduces the number of regular suspend/restart events. 2. When stopping manually, I wait for each task to checkpoint in the event log and then suspend the task. Select "checkpoint_debug" in the "event log options" dialog to have checkpoints reported in the event log. If you are having CPU over-heating problems then I would strongly advise reducing the number of CPUs allocated to BOINC, using the "use at most x% of the CPUs" option in the "computing" tab of the "computing preferences" dialog. Having lost a number of power supplies over the years (not necessarily to BOINC), I never now exceed 50% of real CPUs or 25% of hyperthreaded CPUs - but I'm risk averse, you may not be. Best of luck. |
Send message Joined: 16 Mar 16 Posts: 6 Credit: 858,545 RAC: 0 |
Thanks for the reply. Respective to your point 1: Always selected ever since crunching (because I have plenty of memory - 64GB). WUs are running as sole projects - no need to swap/restart/suspend whatever. Haven't noticed any such behaviour - so this point can be dropped, I guess. Point 2: In the rare cases I have to suspend WUs (weather problems) I suspend and usually wait at LEAST 5 (five) minutes before further actions - should be enough time for checkpointing/cleaning up/whatever. CPU-overheating is not an issue. Options are set so that max temps are never over 70 to 75 degrees C. Average is way lower. I have very good power supplies (platinum 80 750 W) - never had one fade out on me. The CP project is the only one where I am having these issues. Right now I'm wondering why my one PC, which is somewhat picky, is still running nicely the last two long-run WUs I have at the moment. These were suspended just as often etc. as the other PCs - taking my time. So far never had problems running all cores - I don't overclock or fiddel around. Everything is watercooled. Other projects were running successfully at 100% load with extreme pressure on CPU AND especially GPU. Tomorrow we are expecting more thunderbolts etc. - so I have a chance to risk my luck again ... Regards. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,884,997 RAC: 4,577 |
[San-Fernando-Valley wrote:]... Respective to your point 1: My comment was based on this model, which has multiple "Suspended CPDN Monitor - Suspend request from BOINC..." entries in the "stderr out" report. Failures resulting from those suspensions are usually the result of that option not being selected. However, there is another option that causes those multiple suspend/restarts - probably "suspend when computer is in use". |
Send message Joined: 16 Mar 16 Posts: 6 Credit: 858,545 RAC: 0 |
Oh, I see what you mean. But strangely this is the PC where there still are two WUs running. They were all started at the same time. Under options COMPUTING (same on all PCs) I have set: when to suspend all first three boxes blank last one checked ... when non-BOINC usage above 95% Only BOINC is running on these PCs! Maybe you have a point there pertaining to only use, let's say, 50% of cores. Maybe running eight WUs on 4 cores makes problems - strange. One WU thinks it has to swap (suspend) out in order to let the other WUs keep working??? I am no expert on these things. I appreciate your time and help/infos. Hate to repeat myself: but the other projects don't have these problems. I know, one shoud not compare tomatos with potatoes. Have a nice day ... |
©2024 cpdn.org