climateprediction.net (CPDN) home page
Thread 'finish file present too long'

Thread 'finish file present too long'

Message boards : Number crunching : finish file present too long
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 71122 - Posted: 26 Jul 2024, 10:06:25 UTC

The other common factor - for the tasks that have a full MS debugger log - is that the final cause of failure is something like

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0048C569 write attempt to address 0x025F912C
It's not easy to decode those addresses back to the originating module in the multiple sources, and I'm certainly no expert. But it might be another line of attack.
ID: 71122 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71123 - Posted: 26 Jul 2024, 10:59:47 UTC - in response to Message 71122.  

Dave, it's because the timeout on the finish_file was too short in those earlier boinc versions. I think Richard mentioned in an earlier post it's since been raised to 10mins which seems to solve it for busy systems. And it's the busy systems that are constantly suspending/resuming that seem to have the problem. We were wondering whether to impose some kind of limit on the boinc version in the task XML but personally I am reluctant to make the system any more complicated than it is already.

Richard, I had a look at debug output. I know what's going on. For this scenario, it appears the monitor code is being asked by the client to do something to tidy up it's already done and it then fails somewhere. I've created an issue to look into it but it's not the highest priority.
---
CPDN Visiting Scientist
ID: 71123 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 71125 - Posted: 26 Jul 2024, 11:46:03 UTC - in response to Message 71123.  

Yes, PR 3019 on 12 Feb 2019. That comes in the timeline between v7.14 and v7.16
ID: 71125 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 71128 - Posted: 26 Jul 2024, 13:32:05 UTC

An afterthought which occurred to me during my lunchtime walk. v7.14.2 is the last version which was compiled and distributed for 32-bit versions of Windows. People running those would be stuck without an upgrade route. But I've looked, and all the tasks mentioned in this thread are running under 64-bit versions of Windows.

A 64-bit version of Windows will run 32-bit versions of BOINC - and my little celeron is an example of a machine which could only run 32-bit versions. I bought it for debugging that fault. BOINC doesn't seem to report which bitness is in use (unless it's buried deeper in the scheduler contacts), so this is probably a dead-end sidetrack.
ID: 71128 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71129 - Posted: 26 Jul 2024, 13:43:00 UTC - in response to Message 71128.  
Last modified: 26 Jul 2024, 13:43:06 UTC

Interesting point. I have to use an older version of boinc because they abandoned 32bit builds a while ago, which I need as WaH is still 32bit. I've checked and I compile & link against boinc v7.20.2. This is the latest version that still includes the Visual Studio 32bit project files. It does have the code fix from the PR you mentioned. So the WaH side of things is ok?
ID: 71129 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 71130 - Posted: 26 Jul 2024, 14:04:05 UTC - in response to Message 71129.  
Last modified: 26 Jul 2024, 14:06:03 UTC

So far as I know. The error that kept my celeron (and similar low-power devices) locked into 32-bit mode was a crashing bug in a very old 64-bit version of the external SSL library that was being distributed with the early 64-bit versions of BOINC.

Conceptually, it would be very easy to download the v7.14.2 release sources, apply the trivial relaxed time limit patch, and make a v7.14.3 client available to anyone who comes up with a sensible reason for using it. But the sort of volunteers that return tasks with these errors are probably not, shall we say, enthusiastic debuggers.
ID: 71130 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : finish file present too long

©2024 cpdn.org