Message boards : Number crunching : New work discussion - 2
Message board moderation
Previous · 1 . . . 37 · 38 · 39 · 40 · 41 · 42 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
don't bother, just ignore |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
If it was something daft it would have been fixed ages ago. It's more subtle than that. You see no reason because you don't understand the problem or the way the models work.Why would it matter what's happening in the model? Presumably every so often a checkpoint is written, perhaps at the same time a trickle up is done, every 4%? In the event of anything going wrong, on restart the checkpoint file would be loaded and things would continue from that point, losing some work done after the checkpoint. The only possible way I can see this going wrong is if one is in the process of being written when the crash occurs. But in this case, the preceding checkpoint should not be deleted until the new one is written. Just like when you make a backup of your computer, you never delete all your old backups before making a new one. If something happens during the backup process, you've lost everything. these models are not designed to run on systems that can be shutdown instantlyBut that's 99% of computers. The model in question is a Windows model, and we all know Windows reboots for updates without warning [1], assuming because the user isn't there it's ok to do so. For whatever reason (possibly a bad Boinc design), Windows does not allow CPDN to shut down completely. The same happens with LHC, although that has the added complication of running in a virtual machine. I would have expected Windows to wait for Boinc to say "finished closing down", and in turn Boinc to wait for CPDN to say "finished closing down". [1]No matter how may different ways I use to stop it doing so, it keeps thwarting me. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
these models are not designed to run on systems that can be shutdown instantlyA lot of the tasks I get have been failed twice already, so clearly most systems are doing so. Since it's set to only allow three attempts, there must be a lot of tasks which never get done. Perhaps you've set it this way because of tasks which are actually faulty, but there's way more users than work in this project, trying each task more than three times would perhaps be useful? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
The tasks which fail on all three attempts are almost certainly suffering from an issue that generates a value outside of that allowed by the program so will never complete. I believe work is carrying on to try and get to the bottom of the problem but as Glen says, it is proving elusive. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
The tasks which fail on all three attempts are almost certainly suffering from an issue that generates a value outside of that allowed by the program so will never complete. I believe work is carrying on to try and get to the bottom of the problem but as Glen says, it is proving elusive.A lot of my resends have a failure right at the start, then the next guy failed it part way through. At what point does the problem you mention occur? Two examples: https://www.cpdn.org/workunit.php?wuid=12227946 https://www.cpdn.org/workunit.php?wuid=12228012 If those two fail on my computer due to a restart, that will be 3 failures (possibly on a decent task), but presumably since the first two failed at different points, they failed for different reasons. |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
don't feed the trolls |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
don't feed the trollsGrow up, I'm making suggestions and asking questions, your posts however are an utter waste of space. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,003,233 RAC: 14,405 |
"[1]No matter how may different ways I use to stop it doing so, it keeps thwarting me." The answer lies in settings for group permissions in the registry. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
"[1]No matter how may different ways I use to stop it doing so, it keeps thwarting me."That's one of the things I've changed. But Windows randomly overrides it, like they're treating us as criminals for daring to not take their updates. Something sinister is going on and I don't understand why they're legally allowed to do so. I'll reset all 10 computers again, but I doubt it'll last. It may also be the setting is ignored for security updates (which is a lot of them!) To be clear we're talking about the same thing, I do this: Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU] "AUOptions"=dword:00000003 Which alledgedly downloads automatically as normal, but prompts the user before installation. Details here: https://www.ubackup.com/windows-10/disable-windows-10-update-registry-8523.html I know it's for Windows 10, but I assume the same setting applies for 11. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
A lot of my resends have a failure right at the start, then the next guy failed it part way through. At what point does the problem you mention occur?Quite a lot are having three fails at the start. Some others are getting three fails at the same point. I get that some are failing at different places during computation but the BOINC server code isn't sophisticated enough to pick up the differences. My personal view is that to get enough data back, an increase in the number of tasks going out would be more productive than more resends on these tasks. This particular region is covering a larger area and also because of the Himalayas, a more complex one which is what the scientists believe is behind the higher failure rate for this lot after the startup fails where the task switches from the global to the regional model. I don't know enough about the programming involved to say more than that so will leave it there. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
As I have said before, block them with your router, then re-enable the M$ domain when you want to install the updates. I will say that this is an area where Linux policy wins hands down. You always get the choice to restart now/restart later. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
My personal view is that to get enough data back, an increase in the number of tasks going out would be more productive than more resends on these tasks. This particular region is covering a larger area and also because of the Himalayas, a more complex one which is what the scientists believe is behind the higher failure rate for this lot after the startup fails where the task switches from the global to the regional model.I assumed they were sending out the whole lot, or are too busy to create more. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
As I have said before, block them with your router, then re-enable the M$ domain when you want to install the updates.I don't wish to do it manually. I'd never bother. I will say that this is an area where Linux policy wins hands down. You always get the choice to restart now/restart later.Very few things make me like Linux, that is one, the other is having ok and cancel the correct way round. Despite using Windows 99.9% of the time, I very often find myself clicking the wrong button, as I assume affirmative is n the right, like most things in life, eg. the car accelerator. |
Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,459,982 RAC: 22,510 |
For whatever reason (possibly a bad Boinc design), Windows does not allow CPDN to shut down completely.The problem here is, even with exiting BOINC beforehand some of the models still crapped-out on re-start. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,438,314 RAC: 14,039 |
The problem here is, even with exiting BOINC beforehand some of the models still crapped-out on re-start.It makes absolutely no difference to the chance of a fail if you suspend/exit BOINC/etc/etc before shutting down. I know this because I've been looking for a pattern in the way the model fails. WaH is two models; a global one which runs first for 24 hrs and creates the boundary & initial conditions for a regional model (the 25km grid) which then takes those files and runs itself for 24hrs; then it cycles around. It doesn't matter if the task is suspended/shutdown during the global model part or the regional model part, when it restarts it will always redo the global model 24hrs again. The error *always* comes when the regional model starts up again from the rerun global 24hrs. We have some ideas what's causing it but I've not yet been able to reproduce it standalone. Unfortunately the model doesn't produce any traceback diagnostics so it's tedious finding out exactly which part of the code is causing the problem, but I'll get there. |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
A question - Are the "24 hours" you refer to in your post 24 hours as measured by the clock on my wall, or the time the simulation represents? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,028,039 RAC: 20,189 |
A question - Are the "24 hours" you refer to in your post 24 hours as measured by the clock on my wall, or the time the simulation represents?Time the simulation represents. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
WaH is two models; a global one which runs first for 24 hrs and creates the boundary & initial conditions for a regional model (the 25km grid) which then takes those files and runs itself for 24hrs; then it cycles around. It doesn't matter if the task is suspended/shutdown during the global model part or the regional model part, when it restarts it will always redo the global model 24hrs again. The error *always* comes when the regional model starts up again from the rerun global 24hrs. We have some ideas what's causing it but I've not yet been able to reproduce it standalone. Unfortunately the model doesn't produce any traceback diagnostics so it's tedious finding out exactly which part of the code is causing the problem, but I'll get there.What is the reason behind redoing the global part? Why can the original files not be used from the first time it did it? |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,795,079 RAC: 44,452 |
But I want updates, just no reboots until I say so. That isn't too complicate: Set up local WSUS-Server and direct your clients to use it. This works really great for me. The WSUS-server fetches the new patches from Microsoft-Update-Servers. The patches are only released to the clients by the WSUS when I activate them in the WSUS. So I can deliver the Patches to my clients when I want it. Normally I hold them from Patchday several days until I hear (or hear not) if there are bigger Problems Supporting BOINC, a great concept ! |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Far too much hassle. I shouldn't have to go through all this. I never want to do updates manually. I want them to do it themselves, but wait until I say go!! I'm also not prepared to mess around setting up servers, this reminds me of the mess LHC is in, they send out the same data to each individual task running, which can be one per core, and don't cache it, then expect us to run Squid, some horrid Linux thing ported badly to Windows so it keeps failing, to cache locally. Forget it.But I want updates, just no reboots until I say so.That isn't too complicate: Set up local WSUS-Server and direct your clients to use it. I've set the registry entry mentioned earlier and see if it works. Some machines do wait, I guess some forgot it for whatever reason. And Microsoft should be in legal trouble for rebooting someone's property without their permission. We have billions of insane laws, but nothing usefull. |
©2024 cpdn.org