climateprediction.net (CPDN) home page
Thread '"Calculation failure" after whenever i reboot the PC'

Thread '"Calculation failure" after whenever i reboot the PC'

Questions and Answers : Windows : "Calculation failure" after whenever i reboot the PC
Message board moderation

To post messages, you must log in.

AuthorMessage
32iMdZyPN8S4LDMMH73VCvUunc6U

Send message
Joined: 31 May 20
Posts: 1
Credit: 27,953
RAC: 154
Message 70182 - Posted: 22 Jan 2024, 19:10:58 UTC

I've had this problem for a couple of months, and since every task takes around 10 days and nights no tasks has been completed before failure.
Is this a common issue? And is it possible to get shorter but more tasks?
ID: 70182 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 70185 - Posted: 22 Jan 2024, 22:17:21 UTC - in response to Message 70182.  

Always suspending tasks and waiting till you are 100%sure all disk writes are completed before closing Boinc reduces the failure rate but if shutting down every night few tasks are going to make it. Glenn will be running some tests soon which we hope will greatly reduce failures of these tasks however this is unlikely to be ready for the NZ batch which is due soon. The newly compiled code will be able to use some of the more advanced optimisation of recent processors so tasks should run faster reducing numbers of restarts.
ID: 70185 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 70189 - Posted: 23 Jan 2024, 16:04:38 UTC - in response to Message 70182.  

Yes, it's a known problem with this version of the code. If you are unable to successfully run tasks I suggest settings 'No new tasks' for the CPDN project until we roll out a new version of the code (sometime in the next month) (I've done this on my machine).

The only thing you can possibly do is run only 1 boinc task at a time and shutdown any apps to make the machine as quiet as possible, as it's a memory issue that causes the problem.

Cheers, Glenn

I've had this problem for a couple of months, and since every task takes around 10 days and nights no tasks has been completed before failure.
Is this a common issue? And is it possible to get shorter but more tasks?

---
CPDN Visiting Scientist
ID: 70189 · Report as offensive     Reply Quote
Curtis

Send message
Joined: 16 Dec 05
Posts: 27
Credit: 242,905
RAC: 1,153
Message 70536 - Posted: 23 Feb 2024, 13:22:02 UTC - in response to Message 70182.  

I finally got to complete 4 projects! Every time prior my computer sleeps or has to be restarted, and as a result always ends up with the model failing because it was never able to restore from a previous backup properly.

This time I was able to run 4 models at the same time (probably could have done 7 or 8) without problem and they ran for 4 days straight running my intel i9-13900HX at the high 4GHs range. Not sure if this is new to more recent work units, But, 4 day work units I can do regularly. Windows just has so many issues, but it would be great to have more people who can run work units correctly without having to resort to backing up and restoring it from a previous state.
ID: 70536 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 70537 - Posted: 23 Feb 2024, 13:48:54 UTC - in response to Message 70536.  

It's because the current batches are using a new version of the wah2 app, version 8.29. This fixes the problems with tasks crashing on restart.
ID: 70537 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,817,746
RAC: 4,590
Message 71863 - Posted: 16 Nov 2024, 12:18:25 UTC

I got caught out by failing to follow my own advice! My Windows 11 laptop was running a batch 1024 resend with wah2 (region independent) v8.32 when Microsoft Update Tuesday struck: it was actually applied to this machine on 14 November (Thursday).

The task (22528569) failed, so I looked into the log files. The machine rotated the stdoutdae files on this reboot, so I found:

In the 'old' file, the last few lines are:
14-Nov-2024 23:32:05 [climateprediction.net] [sched_op] Deferring communication for 00:01:49
14-Nov-2024 23:32:05 [climateprediction.net] [sched_op] Reason: Unrecoverable error for task wah2_eas25_a17m_201212_24_1024_012323825_1
14-Nov-2024 23:32:10 [climateprediction.net] Computation for task wah2_eas25_a17m_201212_24_1024_012323825_1 finished
{list of absent upload files}
14-Nov-2024 23:32:10 [---] Exiting

StartServiceCtrlDispatcher being called.
This may take several seconds.  Please wait.
Then, the new file starts with:
14-Nov-2024 23:32:54 [---] Starting BOINC client version 8.0.3 for windows_x86_64
{normal startup messages}
14-Nov-2024 23:32:58 [climateprediction.net] [sched_op] Starting scheduler request
14-Nov-2024 23:32:58 [climateprediction.net] Sending scheduler request: To report completed tasks.
and so on.

So it appears that the damage is actually done during closedown - not, as I had assumed, at restart.

Windows applies the updates and triggers the restart "outside working hours", so it will be assuming that the user isn't around to respond to the messages that are sometimes generated during a 'soft' close ('save edited file?', and so on). So, is this version of the app responding badly to 'hard' shutdowns?
ID: 71863 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4542
Credit: 19,039,635
RAC: 18,944
Message 71864 - Posted: 16 Nov 2024, 12:33:09 UTC

Windows applies the updates and triggers the restart "outside working hours", so it will be assuming that the user isn't around to respond to the messages that are sometimes generated during a 'soft' close ('save edited file?', and so on). So, is this version of the app responding badly to 'hard' shutdowns?


Shouldn't MS be doing a soft shutdown rather than a hard shutdown? Could this be a seperate issue from when a task fails following restart after a power cut?
ID: 71864 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,030,773
RAC: 4,296
Message 71865 - Posted: 17 Nov 2024, 2:13:50 UTC - in response to Message 71863.  
Last modified: 17 Nov 2024, 2:15:14 UTC

There's a whole day difference between the task's CPU and Run times. Could it be that a problem ("The storage control block address is invalid.") occurred earlier but the task only failed at restart?
ID: 71865 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,817,940
RAC: 12,877
Message 71866 - Posted: 17 Nov 2024, 10:52:19 UTC - in response to Message 71865.  

'storage control block address is invalid' is nothing to do with the way the tasks (or client) are shutdown. It comes from the way files are handled in the code and can cause problems with Windows Update. There are two standard ways of accessing a file in C; using the filename ('fopen' function), or using a file descriptor to associate with a file ('fdopen'). Both work fine on Linux and can be used on Windows. But Windows uses a different mechanism, involving file handles, which are more complex and tightly integrated with Windows’ security and memory management systems.

Using fopen('filename') directly bypasses the intermediate step of dealing with file descriptors, which can lead to issues like invalid storage control blocks. When fdopen(file-descriptor) is used, there’s a higher chance of mismatched or corrupted file descriptors, especially during complex operations like Windows Update, which can alter the system’s state and memory.

There are 2 places where fdopen appear in the code and both are in libraries, but only the one in the ZLIB library is used. This library version is very old, been there since the early days of CPDN. It's on my list to update it but it's not as the most urgent.
---
CPDN Visiting Scientist
ID: 71866 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,817,746
RAC: 4,590
Message 71867 - Posted: 17 Nov 2024, 12:23:18 UTC - in response to Message 71865.  

And I've filleted out the CPDN messages from the 'old' log file, and there's nothing unexpected. My models run 24/7 without suspension or swap-out (except when the unexpected happens ...), so it's easy to check that the 'trickle' pattern is uniform with a flicker test.

But I do tend to run the machines full-bore when work is available. There was a second CPDN task running when this one started, which may account for some extra overload: and the other cores were working on simpler, integer-only, tasks from another project throughout.

My suspicion is that all the extra bells and whistles added to Windows 11 and constantly checking for outside news to tell me about, are the main cause of the lost time.
ID: 71867 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,030,773
RAC: 4,296
Message 71868 - Posted: 18 Nov 2024, 3:44:16 UTC - in response to Message 71867.  

And I've filleted out the CPDN messages from the 'old' log file, and there's nothing unexpected. My models run 24/7 without suspension or swap-out (except when the unexpected happens ...), so it's easy to check that the 'trickle' pattern is uniform with a flicker test.

But I do tend to run the machines full-bore when work is available. There was a second CPDN task running when this one started, which may account for some extra overload: and the other cores were working on simpler, integer-only, tasks from another project throughout.

My suspicion is that all the extra bells and whistles added to Windows 11 and constantly checking for outside news to tell me about, are the main cause of the lost time.

I mentioned it because loosing a day out of 7 seems significant and from cursory observation of failed tasks, it's not uncommon to have significant differences between Run and CPU times. However, even your completed tasks have a day differences so it does seem like your system is a bit overloaded. I don't know that Windows on its own would take up that much computing time.
ID: 71868 · Report as offensive     Reply Quote

Questions and Answers : Windows : "Calculation failure" after whenever i reboot the PC

©2024 cpdn.org