Thread '"Calculation failure" after whenever i reboot the PC'

Author	Message
32iMdZyPN8S4LDMMH73VCvUunc6U Send message Joined: 31 May 20 Posts: 1 Credit: 27,953 RAC: 154	Message 70182 - Posted: 22 Jan 2024, 19:10:58 UTC I've had this problem for a couple of months, and since every task takes around 10 days and nights no tasks has been completed before failure. Is this a common issue? And is it possible to get shorter but more tasks? ID: 70182 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4558 Credit: 19,039,635 RAC: 18,944	Message 70185 - Posted: 22 Jan 2024, 22:17:21 UTC - in response to Message 70182. Always suspending tasks and waiting till you are 100%sure all disk writes are completed before closing Boinc reduces the failure rate but if shutting down every night few tasks are going to make it. Glenn will be running some tests soon which we hope will greatly reduce failures of these tasks however this is unlikely to be ready for the NZ batch which is due soon. The newly compiled code will be able to use some of the more advanced optimisation of recent processors so tasks should run faster reducing numbers of restarts. ID: 70185 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 70189 - Posted: 23 Jan 2024, 16:04:38 UTC - in response to Message 70182. Yes, it's a known problem with this version of the code. If you are unable to successfully run tasks I suggest settings 'No new tasks' for the CPDN project until we roll out a new version of the code (sometime in the next month) (I've done this on my machine). The only thing you can possibly do is run only 1 boinc task at a time and shutdown any apps to make the machine as quiet as possible, as it's a memory issue that causes the problem. Cheers, Glenn I've had this problem for a couple of months, and since every task takes around 10 days and nights no tasks has been completed before failure. Is this a common issue? And is it possible to get shorter but more tasks? --- CPDN Visiting Scientist ID: 70189 · Reply Quote

Curtis Send message Joined: 16 Dec 05 Posts: 27 Credit: 247,047 RAC: 341	Message 70536 - Posted: 23 Feb 2024, 13:22:02 UTC - in response to Message 70182. I finally got to complete 4 projects! Every time prior my computer sleeps or has to be restarted, and as a result always ends up with the model failing because it was never able to restore from a previous backup properly. This time I was able to run 4 models at the same time (probably could have done 7 or 8) without problem and they ran for 4 days straight running my intel i9-13900HX at the high 4GHs range. Not sure if this is new to more recent work units, But, 4 day work units I can do regularly. Windows just has so many issues, but it would be great to have more people who can run work units correctly without having to resort to backing up and restoring it from a previous state. ID: 70536 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 70537 - Posted: 23 Feb 2024, 13:48:54 UTC - in response to Message 70536. It's because the current batches are using a new version of the wah2 app, version 8.29. This fixes the problems with tasks crashing on restart. ID: 70537 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 71863 - Posted: 16 Nov 2024, 12:18:25 UTC I got caught out by failing to follow my own advice! My Windows 11 laptop was running a batch 1024 resend with wah2 (region independent) v8.32 when Microsoft Update Tuesday struck: it was actually applied to this machine on 14 November (Thursday). The task (22528569) failed, so I looked into the log files. The machine rotated the stdoutdae files on this reboot, so I found: In the 'old' file, the last few lines are: 14-Nov-2024 23:32:05 [climateprediction.net] [sched_op] Deferring communication for 00:01:49 14-Nov-2024 23:32:05 [climateprediction.net] [sched_op] Reason: Unrecoverable error for task wah2_eas25_a17m_201212_24_1024_012323825_1 14-Nov-2024 23:32:10 [climateprediction.net] Computation for task wah2_eas25_a17m_201212_24_1024_012323825_1 finished {list of absent upload files} 14-Nov-2024 23:32:10 [---] Exiting StartServiceCtrlDispatcher being called. This may take several seconds. Please wait. Then, the new file starts with: 14-Nov-2024 23:32:54 [---] Starting BOINC client version 8.0.3 for windows_x86_64 {normal startup messages} 14-Nov-2024 23:32:58 [climateprediction.net] [sched_op] Starting scheduler request 14-Nov-2024 23:32:58 [climateprediction.net] Sending scheduler request: To report completed tasks. and so on. So it appears that the damage is actually done during closedown - not, as I had assumed, at restart. Windows applies the updates and triggers the restart "outside working hours", so it will be assuming that the user isn't around to respond to the messages that are sometimes generated during a 'soft' close ('save edited file?', and so on). So, is this version of the app responding badly to 'hard' shutdowns? ID: 71863 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4558 Credit: 19,039,635 RAC: 18,944	Message 71864 - Posted: 16 Nov 2024, 12:33:09 UTC Windows applies the updates and triggers the restart "outside working hours", so it will be assuming that the user isn't around to respond to the messages that are sometimes generated during a 'soft' close ('save edited file?', and so on). So, is this version of the app responding badly to 'hard' shutdowns? Shouldn't MS be doing a soft shutdown rather than a hard shutdown? Could this be a seperate issue from when a task fails following restart after a power cut? ID: 71864 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 71865 - Posted: 17 Nov 2024, 2:13:50 UTC - in response to Message 71863. Last modified: 17 Nov 2024, 2:15:14 UTC There's a whole day difference between the task's CPU and Run times. Could it be that a problem ("The storage control block address is invalid.") occurred earlier but the task only failed at restart? ID: 71865 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71866 - Posted: 17 Nov 2024, 10:52:19 UTC - in response to Message 71865. 'storage control block address is invalid' is nothing to do with the way the tasks (or client) are shutdown. It comes from the way files are handled in the code and can cause problems with Windows Update. There are two standard ways of accessing a file in C; using the filename ('fopen' function), or using a file descriptor to associate with a file ('fdopen'). Both work fine on Linux and can be used on Windows. But Windows uses a different mechanism, involving file handles, which are more complex and tightly integrated with Windowsâ€™ security and memory management systems. Using fopen('filename') directly bypasses the intermediate step of dealing with file descriptors, which can lead to issues like invalid storage control blocks. When fdopen(file-descriptor) is used, thereâ€™s a higher chance of mismatched or corrupted file descriptors, especially during complex operations like Windows Update, which can alter the systemâ€™s state and memory. There are 2 places where fdopen appear in the code and both are in libraries, but only the one in the ZLIB library is used. This library version is very old, been there since the early days of CPDN. It's on my list to update it but it's not as the most urgent. --- CPDN Visiting Scientist ID: 71866 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 71867 - Posted: 17 Nov 2024, 12:23:18 UTC - in response to Message 71865. And I've filleted out the CPDN messages from the 'old' log file, and there's nothing unexpected. My models run 24/7 without suspension or swap-out (except when the unexpected happens ...), so it's easy to check that the 'trickle' pattern is uniform with a flicker test. But I do tend to run the machines full-bore when work is available. There was a second CPDN task running when this one started, which may account for some extra overload: and the other cores were working on simpler, integer-only, tasks from another project throughout. My suspicion is that all the extra bells and whistles added to Windows 11 and constantly checking for outside news to tell me about, are the main cause of the lost time. ID: 71867 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 71868 - Posted: 18 Nov 2024, 3:44:16 UTC - in response to Message 71867. And I've filleted out the CPDN messages from the 'old' log file, and there's nothing unexpected. My models run 24/7 without suspension or swap-out (except when the unexpected happens ...), so it's easy to check that the 'trickle' pattern is uniform with a flicker test. But I do tend to run the machines full-bore when work is available. There was a second CPDN task running when this one started, which may account for some extra overload: and the other cores were working on simpler, integer-only, tasks from another project throughout. My suspicion is that all the extra bells and whistles added to Windows 11 and constantly checking for outside news to tell me about, are the main cause of the lost time. I mentioned it because loosing a day out of 7 seems significant and from cursory observation of failed tasks, it's not uncommon to have significant differences between Run and CPU times. However, even your completed tasks have a day differences so it does seem like your system is a bit overloaded. I don't know that Windows on its own would take up that much computing time. ID: 71868 · Reply Quote