Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next
Author | Message |
---|---|
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
If the model blows up from a numerical problem that would obviously count as 'invalid' (as opposed to any restart problem). If it completes then it's not possible to tell from a single run. it would need many runs to look at the statistics of the model results and see if there is a bias in the results compared to runs on bare metal.Not invalid as in rejected by the software but invalid as in useless for the science.What I am not sure of Glenn is whether even if the science data is the same between both runs, whether tasks that complete under WINE but not using Windows are still invalid or even if there is a way of checking that?What do you mean by 'invalid'? The WINE issue was more straightforward. We expected the model to crash with a memory error when it was run but under WINE it *always* ran, so there's clearly something special about that environment for memory errors. It doesn't necessarily mean working tasks would be 'wrong'. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Do you have 'Leave non-GPU tasks in memory when suspended' enabled under 'General' (or 'Memory/Disk') in boincmgr?It's restarting the model from a shutdown that risks the model failing like this.None my "two minute crashes" have been the result of re-start after a shutdown. If not, if the client suspends the tasks (non-boinc CPU too high for instance), the tasks might be kicked out of memory and then it would have to restart from the disk restarts. Some of the tasks will fail despite this because they are being deliberately perturbed, some of the forecasts will be physically unrealistic for the model to handle. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
The WINE issue was more straightforward. We expected the model to crash with a memory error when it was run but under WINE it *always* ran, so there's clearly something special about that environment for memory errors. It doesn't necessarily mean working tasks would be 'wrong'. That is what I suspected the answer would be. I don't even know if there are enough hosts using WINE to do that statistical work even if we could distinguish between hosts using WINE and those using Windows. In a bit under an hour I should be able to compare the first zips between tasks running under WINE and Windows in a VM. I will let you know what I find and PM if I get stuck with anything or come up with anything I think significant. Edit: both _1.zip files are identical. I actually ran diff on the .nc files contained in the zip individually first rather than on the two zip files which would have saved me a couple of minutes faffing around. |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,027 RAC: 4,083 |
Do you have 'Leave non-GPU tasks in memory when suspended' enabled under 'General' (or 'Memory/Disk') in boincmgr? I rarely suspend BOINC, but until learning about this current batch not shutting down and the restarting properly I was shutting down BOINC in the minutes before shutting the computer off. I do however have "leave non-GPU tasks in memory when suspending" set ON. edit to add - My sole computer is running Windows 10, has 32GB memory and 8 "real" cores, plus another 8 with hyper threading turned on ( and virtualisation is turned on, but no virtual machine running just now. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,825,218 RAC: 19,877 |
I am sure alot of the hard fails are simply due to this and not because of an inherent problem with the model perturbations. Not sure whether CPDN will decide to rerun them or not yet. Yeah, 12 of my 32 failures so far have been due to BOINC restarts. Some due to an unintentional PC shutdown and others due to BOINC seemingly crashing (came to check on things and found BOINC wasn't running). Sucks as most of those have run for ~12 days and had no more than a day to go to finish. Hopefully the remaining 21 will successfully finish as I only have 5 successfully finish so far. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
and others due to BOINC seemingly crashingBOINC crashing when left unattended is a new one for me. The only times I can remember BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again. I think that has been fixed now but not 100% certain. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,825,218 RAC: 19,877 |
and others due to BOINC seemingly crashingBOINC crashing when left unattended is a new one for me. The only times I can remember BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again. I think that has been fixed now but not 100% certain. This happens to me once in a while, I'll come check on it and find BOINC isn't running. I can't remember when the first time was, this year or last. I think this was the first time with the latest version (7.24.1). It was also the first time with CPDN running which was costly due to loss of tasks and a lot of processing time. I haven't tried to investigate it in any way yet. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
... BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again ...Both Dave and I participated in a conversation with the developers on this one: #4784. The code is fixed and tested, but the release process has stalled, and it may not be in public use yet. But for clarity: it was only the Manager which crashed. The client, and hence the science applications, kept running. I've also not had any problems with the client crashing by itself. My main problems have been:
I think it's been suggested that the second can happen if the host memory becomes too full. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
If it happens again it would be good to check in the system logs to see why it failed. I'm assuming this is the Windows client only and not linux (I've never seen it myself).BOINC crashing when left unattended is a new one for me. The only times I can remember BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again. I think that has been fixed now but not 100% certain.This happens to me once in a while, I'll come check on it and find BOINC isn't running. I can't remember when the first time was, this year or last. I think this was the first time with the latest version (7.24.1). It was also the first time with CPDN running which was costly due to loss of tasks and a lot of processing time. I haven't tried to investigate it in any way yet. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Both Dave and I participated in a conversation with the developers on this one: #4784. The code is fixed and tested, but the release process has stalled, and it may not be in public use yet. Just tried switching back and forth between simple and advanced views 10 times on 7.24.1 without inducing a crash but it didn't happen every time before. I am not sure how many times I need to test it to be sure? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
I am not sure how many times I need to test it to be sure?When I started that ticket, it was near immediate - one or two cycles at most. I forget how many times I tried it when reporting a successful fix (lower down - after a second problem with the size of text boxes), but at least two - that was probably enough. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,825,218 RAC: 19,877 |
If it happens again it would be good to check in the system logs to see why it failed. I'm assuming this is the Windows client only and not linux (I've never seen it myself).BOINC crashing when left unattended is a new one for me. The only times I can remember BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again. I think that has been fixed now but not 100% certain.This happens to me once in a while, I'll come check on it and find BOINC isn't running. I can't remember when the first time was, this year or last. I think this was the first time with the latest version (7.24.1). It was also the first time with CPDN running which was costly due to loss of tasks and a lot of processing time. I haven't tried to investigate it in any way yet. Yes, it's a Windows 10 PC (I use WSL2 for any Linux BOINC work which I haven't ran for months now). I'd look but don't really know where and what to look for. I did look at Reliability History and Event Viewer after posting that first post but couldn't find anything but I'm also not exactly sure what to look for. It's definitely not related to switching between views as I don't switch and always use the Advanced view. It's also not just the Manager crashing as that'd be easy to tell (BOINC start up, CPU temperature changes). It also isn't due to a system reboot due to some critical system component crash or power failure as that'd also be easy to tell. I have RyzenMaster & MSI Afterburner that start up first to turn on undervolt settings before other things like BOINC start and I have to manually apply the settings & close those programs before anything else proceeds so I can tell when there was a system restart. |
Send message Joined: 1 Apr 12 Posts: 3 Credit: 15,024,721 RAC: 9,374 |
Is this similar to what's being discussed? Here is what displayed in a pop-up dialog on October 19, 2023: BIONIC Manager - Connection Error Invalid client RPC password. Try reinstalling BOINC. The BOINC Manager was blank - no project, task or other data displayed. I used the preferred shutdown method for BOINC and restarted my computer. Climate project and tasks displayed after restart. The two trickles waiting to send displays errors. BIONIC resume computing the tasks. About 30 minutes later, it occurred again and all of the tasks for the project crashed. https://www.cpdn.org/result.php?resultid=22347881 15 (0x0000000F) Unknown error code https://www.cpdn.org/result.php?resultid=22347188 https://www.cpdn.org/result.php?resultid=22336571 Suspended CPDN Monitor - Suspend request from BOINC... 10:45:57 (12972): BOINC client no longer exists - exiting 10:45:57 (12972): timer handler: client dead, exiting CPDN Monitor - No 'heartbeat' from BOINC... |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,825,218 RAC: 19,877 |
Sardis73, Looks different than mine. For me, both Client & Manger crash, in your case it seems to be just the Client. I've seen the Invalid Client RPC password error before but it usually happens right away when you start BOINC. The fact that yours happens some time after everything has started and been running for a while is new and a bit puzzling. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,750,791 RAC: 3,898 |
Hello again...Most probably it's the windows memory management... Greetings to all believers in science! I've a rock-solid computer that is more than 11 years old and still works at it's best. The same errors appearing over here with this computer. Every new start of BOINC and restarting several CPDN-models will let all the models die. But the good news...they will go on, because I restarted always from backup. So none of the models is faulty and work on, and also will finish successfully. So far no other errors, but BOINC manager VERSION 7.24.1 isn't very reliable. Cheers, and have peaceful day, Bonsai911 |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
About 30 minutes later, it occurred again and all of the tasks for the project crashed.This task shows the error 'system cannot find the drive specified'. I think that's why the client & tasks died. I've seen that multiple times on the windows tasks and there was some discussion about it earlier (maybe this thread?). I forget the outcome of the discussion but I wonder whether it indicates a failing drive? Or maybe one that's getting hammered by other process(es). Perhaps look at the drive's SMART diagnostics to check what's going on with it. <![CDATA[ <message> The system cannot find the drive specified. (0xf) - exit code 15 (0xf)</message> <stderr_txt> 10:46:05 (12540): BOINC client no longer exists - exiting 10:46:05 (12540): timer handler: client dead, exiting CPDN Monitor - No 'heartbeat' from BOINC... --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
The phrase BOINC client no longer existsIs, I think, written by the CPDN wrapper - called in from the BOINC api library. The only time it's used in BOINC code is https://github.com/BOINC/boinc/blob/master/api/boinc_api.cpp#L508, where its usage is determined by a test on client_pid I take that to mean that the client is no longer running in memory - it wouldn't tell us anything about the binary file being stored on disk. So I would take it that "The system cannot find the drive specified" is detected first by the client, and causes that to crash: the subsequent CPDN exit would be a result of that, not caused by it. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,750,791 RAC: 3,898 |
[quote] The same^↑↑↑↑↑ occured over here on my computer, but I think SMART diagnostics isn't helpful at all, because it happened on my newest (in use: less than one year) and error-free-so-far solid-state-drive. Also I'm monitoring my drives with three real-time programs. Also no error so far on any drive. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
The same^↑↑↑↑↑ occured over here on my computer, but I think SMART diagnostics isn't helpful at all,The error "The system cannot find the drive specified" is coming from Windows and points to a problem accessing the drive for whatever reason. It's probably intermittent. Doing a google search on the error message shows plenty of hits with various suggestions for why it's happening and the remedies (device drivers, virus checkers, etc etc). (e.g. https://www.thewindowsclub.com/the-system-cannot-find-the-drive-specified-fixed) |
Send message Joined: 23 Feb 05 Posts: 7 Credit: 1,423,261 RAC: 213 |
Speaking of one year deadlines, I've just been handed the HadSM4 task below from last November that has finally timed out from the original BOINCer. Does anyone know whether it is still of use or would it just be a waste of electricity? https://www.cpdn.org/workunit.php?wuid=12154819 |
©2024 cpdn.org