Questions and Answers : Windows : "Calculation failure" after whenever i reboot the PC
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 May 20 Posts: 1 Credit: 27,953 RAC: 154 |
I've had this problem for a couple of months, and since every task takes around 10 days and nights no tasks has been completed before failure. Is this a common issue? And is it possible to get shorter but more tasks? |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Always suspending tasks and waiting till you are 100%sure all disk writes are completed before closing Boinc reduces the failure rate but if shutting down every night few tasks are going to make it. Glenn will be running some tests soon which we hope will greatly reduce failures of these tasks however this is unlikely to be ready for the NZ batch which is due soon. The newly compiled code will be able to use some of the more advanced optimisation of recent processors so tasks should run faster reducing numbers of restarts. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
Yes, it's a known problem with this version of the code. If you are unable to successfully run tasks I suggest settings 'No new tasks' for the CPDN project until we roll out a new version of the code (sometime in the next month) (I've done this on my machine). The only thing you can possibly do is run only 1 boinc task at a time and shutdown any apps to make the machine as quiet as possible, as it's a memory issue that causes the problem. Cheers, Glenn I've had this problem for a couple of months, and since every task takes around 10 days and nights no tasks has been completed before failure. --- CPDN Visiting Scientist |
Send message Joined: 16 Dec 05 Posts: 27 Credit: 242,905 RAC: 1,153 |
I finally got to complete 4 projects! Every time prior my computer sleeps or has to be restarted, and as a result always ends up with the model failing because it was never able to restore from a previous backup properly. This time I was able to run 4 models at the same time (probably could have done 7 or 8) without problem and they ran for 4 days straight running my intel i9-13900HX at the high 4GHs range. Not sure if this is new to more recent work units, But, 4 day work units I can do regularly. Windows just has so many issues, but it would be great to have more people who can run work units correctly without having to resort to backing up and restoring it from a previous state. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
It's because the current batches are using a new version of the wah2 app, version 8.29. This fixes the problems with tasks crashing on restart. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,819,403 RAC: 4,657 |
I got caught out by failing to follow my own advice! My Windows 11 laptop was running a batch 1024 resend with wah2 (region independent) v8.32 when Microsoft Update Tuesday struck: it was actually applied to this machine on 14 November (Thursday). The task (22528569) failed, so I looked into the log files. The machine rotated the stdoutdae files on this reboot, so I found: In the 'old' file, the last few lines are: 14-Nov-2024 23:32:05 [climateprediction.net] [sched_op] Deferring communication for 00:01:49 14-Nov-2024 23:32:05 [climateprediction.net] [sched_op] Reason: Unrecoverable error for task wah2_eas25_a17m_201212_24_1024_012323825_1 14-Nov-2024 23:32:10 [climateprediction.net] Computation for task wah2_eas25_a17m_201212_24_1024_012323825_1 finished {list of absent upload files} 14-Nov-2024 23:32:10 [---] Exiting StartServiceCtrlDispatcher being called. This may take several seconds. Please wait.Then, the new file starts with: 14-Nov-2024 23:32:54 [---] Starting BOINC client version 8.0.3 for windows_x86_64 {normal startup messages} 14-Nov-2024 23:32:58 [climateprediction.net] [sched_op] Starting scheduler request 14-Nov-2024 23:32:58 [climateprediction.net] Sending scheduler request: To report completed tasks.and so on. So it appears that the damage is actually done during closedown - not, as I had assumed, at restart. Windows applies the updates and triggers the restart "outside working hours", so it will be assuming that the user isn't around to respond to the messages that are sometimes generated during a 'soft' close ('save edited file?', and so on). So, is this version of the app responding badly to 'hard' shutdowns? |
Send message Joined: 15 May 09 Posts: 4542 Credit: 19,039,635 RAC: 18,944 |
Windows applies the updates and triggers the restart "outside working hours", so it will be assuming that the user isn't around to respond to the messages that are sometimes generated during a 'soft' close ('save edited file?', and so on). So, is this version of the app responding badly to 'hard' shutdowns? Shouldn't MS be doing a soft shutdown rather than a hard shutdown? Could this be a seperate issue from when a task fails following restart after a power cut? |
Send message Joined: 12 Apr 21 Posts: 318 Credit: 15,031,602 RAC: 4,207 |
There's a whole day difference between the task's CPU and Run times. Could it be that a problem ("The storage control block address is invalid.") occurred earlier but the task only failed at restart? |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,817,940 RAC: 12,877 |
'storage control block address is invalid' is nothing to do with the way the tasks (or client) are shutdown. It comes from the way files are handled in the code and can cause problems with Windows Update. There are two standard ways of accessing a file in C; using the filename ('fopen' function), or using a file descriptor to associate with a file ('fdopen'). Both work fine on Linux and can be used on Windows. But Windows uses a different mechanism, involving file handles, which are more complex and tightly integrated with Windows’ security and memory management systems. Using fopen('filename') directly bypasses the intermediate step of dealing with file descriptors, which can lead to issues like invalid storage control blocks. When fdopen(file-descriptor) is used, there’s a higher chance of mismatched or corrupted file descriptors, especially during complex operations like Windows Update, which can alter the system’s state and memory. There are 2 places where fdopen appear in the code and both are in libraries, but only the one in the ZLIB library is used. This library version is very old, been there since the early days of CPDN. It's on my list to update it but it's not as the most urgent. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,819,403 RAC: 4,657 |
And I've filleted out the CPDN messages from the 'old' log file, and there's nothing unexpected. My models run 24/7 without suspension or swap-out (except when the unexpected happens ...), so it's easy to check that the 'trickle' pattern is uniform with a flicker test. But I do tend to run the machines full-bore when work is available. There was a second CPDN task running when this one started, which may account for some extra overload: and the other cores were working on simpler, integer-only, tasks from another project throughout. My suspicion is that all the extra bells and whistles added to Windows 11 and constantly checking for outside news to tell me about, are the main cause of the lost time. |
Send message Joined: 12 Apr 21 Posts: 318 Credit: 15,031,602 RAC: 4,207 |
And I've filleted out the CPDN messages from the 'old' log file, and there's nothing unexpected. My models run 24/7 without suspension or swap-out (except when the unexpected happens ...), so it's easy to check that the 'trickle' pattern is uniform with a flicker test. I mentioned it because loosing a day out of 7 seems significant and from cursory observation of failed tasks, it's not uncommon to have significant differences between Run and CPU times. However, even your completed tasks have a day differences so it does seem like your system is a bit overloaded. I don't know that Windows on its own would take up that much computing time. |
©2024 cpdn.org