Thread 'Completing a WU? Impossible. What am i doing wrong?'

Author	Message
Luca Send message Joined: 20 Jan 23 Posts: 3 Credit: 338,203 RAC: 1,606	Message 70004 - Posted: 28 Oct 2023, 19:17:19 UTC Apart from PrimeGrid this project has the longest WU i've ever seen. I'm fine with that and i would love to contribute. But there's a huge problem. Everytime i open and close BOINC or turn off or restart my PC i risk to lose the WU i was working on. Right now i was around 40%, i restarted and the WU disappeared again. What am i doing wrong? I can't leave it on and working 24/7. ID: 70004 · Reply Quote

rob Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073	Message 70005 - Posted: 28 Oct 2023, 19:40:02 UTC - in response to Message 70004. Sadly this is a known problem with some (many?) of recent CPDN batches of work. It's all to do with the way the task saves and restores the file required to do a restart not working as intended..... ID: 70005 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 70007 - Posted: 28 Oct 2023, 20:05:24 UTC - in response to Message 70005. If you can suspend to disk instead of just turning off, that seems to at least greatly reduce the problem. I have two tasks of the current batch running in a VM and saving the machine state several times hasn't lost them so far though i hope I don't need to do so again before they finish. ID: 70007 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784	Message 70008 - Posted: 29 Oct 2023, 9:34:10 UTC This particular Weather@Home batch has problems restarting from its checkpoint files. Tasks often fail attempting to restart. It has to restart every time boinc is shutdown (PC poweroff), or, when computation is suspended and the task is moved out of memory. As said previously, if you can it's good practice to set 'Leave non-GPU tasks in memory while suspended', as this means any task from any project doesn't need to restart from its checkpoint files on disk, which means the task will finish quicker. For Weather@Home it reduces the possibility of failure. It's a known problem which we're working on. --- CPDN Visiting Scientist ID: 70008 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,374,828 RAC: 10,749	Message 70009 - Posted: 29 Oct 2023, 11:33:36 UTC - in response to Message 70004. ... I can't leave it on and working 24/7. 'Sleep' rather than 'power off' will keep the Windoze system in memory and works OK for me with WAH2 tasks. Additionally, manually control 'Windoze Update' with 'Pause' for as long as it will let you and restart the updates after the tasks have finished. 'Hibernate' should behave similarly, saving to disc, although I have not tested the feature with WAH2 tasks ID: 70009 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784	Message 70010 - Posted: 29 Oct 2023, 14:57:35 UTC - in response to Message 70009. Last modified: 29 Oct 2023, 14:58:18 UTC ... I can't leave it on and working 24/7. 'Sleep' rather than 'power off' will keep the Windoze system in memory and works OK for me with WAH2 tasks. Additionally, manually control 'Windoze Update' with 'Pause' for as long as it will let you and restart the updates after the tasks have finished. 'Hibernate' should behave similarly, saving to disc, although I have not tested the feature with WAH2 tasks I tried that some time ago and it did not work (Windows 11), I still lost WaH tasks upon wake after sleep. My guess was the system was still pushing tasks out of RAM to swap for whatever reason. My experienced failure rate is about 1 in 5 tasks on a client start so it might just be luck? This has been a problem with WaH for a while. There's a bug in the code that we're looking for. ID: 70010 · Reply Quote

Bastian Baum Send message Joined: 18 Aug 18 Posts: 1 Credit: 5,043,538 RAC: 10,197	Message 70011 - Posted: 29 Oct 2023, 19:58:51 UTC Last modified: 29 Oct 2023, 20:09:00 UTC But if it's a bug in the code, there's something that irritates me. Recently I got a task from Batch 994 on one of my old systems (Intel Core2 Duo CPU T5900 from 2008, 32-Bit-Architecture + Windows 10 64-Bit + Boinc 7.22.2). Of course it's a slow system but I finished the task successfully in 40 days and 15 hours with 34+ Windows-Restarts! I am monitoring some Apps with this system during the day and I usually shut it down every evening. You can count the "Quit request from BOINC" in the stderr.txt file here: https://www.cpdn.org/result.php?resultid=22326043 Now I got another task from Batch 996 on the system and until now I shut it down and restarted the system 8 or 9 times. This task seems to be stable, too. Look here: https://www.cpdn.org/result.php?resultid=22347602 This is just a subjective impression, but as newer my systems get, as higher is the rate of crashed tasks during the restart. I have a small Server with a Intel Xeon Scaleable CPU from 2019 with some VMs on it, and it seems that I can finish 3-6 tasks of about 40 tasks from Batch 996 that I caught (loss-rate 85 - 92.5 % during two windows-update-restarts). My newest System has a AMD Ryzen 5 5625U CPU from 2023 (Windows 11 + Boinc 7.24.1) got 14 tasks and I lost all task during the first restart (loss-rate: 100%). Of course I know that the analysis is more complex than just looking at loss-rates and CPUs, I ignored the RAM for example. Maybe the minimal task-sample-size of about 60 tasks misleads my thoughts and I only had luck with the tasks on my older machine. But I am wondering why tasks can survive 34 or more restarts from checkpoints on a slow machine, but crash on newer, faster machines? If there is a structual bug in the code, shouldn't this affect all systems with a nearly equal rate? Or do the restart crashes depend on how fast the files are loaded into the RAM or is it "old" app-code not compatible with "new" system architectures(32-bit vs. 64-bit for example)? Would it be a temporary solution for users to use older systems to avoid too much crashes of the recent instable tasks? ID: 70011 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784	Message 70016 - Posted: 30 Oct 2023, 14:27:09 UTC - in response to Message 70011. Last modified: 30 Oct 2023, 14:27:24 UTC Yes, I've noticed this apparent behaviour too. I think this is down to the way the code is compiled; the compiler is told to add in conditional code for different instruction sets depending what chip it finds. For instance, instructions for SSE2/3/4.* are included. When the code executes it will use different assembly instructions depending on what capabilities it finds on the chip. Older chips therefore will not necessarily be using the same assembler as more recent chips. That's great for speed but a bugger for debugging. The executable sent out with WaH is also old. It's not been recompiled for many years and that could also be introducing issues. --- CPDN Visiting Scientist ID: 70016 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,890,678 RAC: 18,887	Message 70020 - Posted: 30 Oct 2023, 20:42:50 UTC - in response to Message 70016. Yes, I've noticed this apparent behaviour too. I think this is down to the way the code is compiled; the compiler is told to add in conditional code for different instruction sets depending what chip it finds. For instance, instructions for SSE2/3/4.* are included. When the code executes it will use different assembly instructions depending on what capabilities it finds on the chip. Older chips therefore will not necessarily be using the same assembler as more recent chips. That's great for speed but a bugger for debugging. The executable sent out with WaH is also old. It's not been recompiled for many years and that could also be introducing issues. The above sounds like more likely possible explanations than some of the previous ones given, at least to me it does. Have you or anyone else tried recompiling the executable and running CPDN via the Anonymous Platform setup (which allows you to use your own executables)? That could give some useful info in finding the problem. Anonymous Platform setup is described here: https://boinc.berkeley.edu/wiki/Anonymous_platform ID: 70020 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784	Message 70021 - Posted: 31 Oct 2023, 8:46:15 UTC - in response to Message 70020. Last modified: 31 Oct 2023, 8:46:37 UTC I'm working on the model standalone in Linux, then we'll rebuild in Windows and finally move it to the boinc platform. One step at a time. Richard introduced me to the anonymous platform a while ago but CPDN do not support it. ID: 70021 · Reply Quote

Drago75 Send message Joined: 8 Jan 22 Posts: 9 Credit: 1,780,471 RAC: 3,152	Message 70023 - Posted: 31 Oct 2023, 10:11:53 UTC - in response to Message 70021. This may have been answered already in another thread but my question fits this subject. While crunching the wus they send intermediate progress checkpoints back to your server, I believe they are refered to as "trickles" for which we are awarded credit. If a wu fails after that trickle-save does that mean that that wu is sent out to another volunteer from that point or from scratch? Tom ID: 70023 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784	Message 70024 - Posted: 31 Oct 2023, 10:26:37 UTC - in response to Message 70023. This may have been answered already in another thread but my question fits this subject. While crunching the wus they send intermediate progress checkpoints back to your server, I believe they are refered to as "trickles" for which we are awarded credit. If a wu fails after that trickle-save does that mean that that wu is sent out to another volunteer from that point or from scratch? It starts from scratch, the beginning of the model run. You get credit for computing work done up to the last trickle, which won't be far from the point the model fails. It's better to do it that way rather than try to send out the latest checkpoint/restart files to the next user, as it keeps the results consistent. There's no guarantee one machine will produce the same results as the next. ID: 70024 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 70025 - Posted: 31 Oct 2023, 11:16:40 UTC - in response to Message 70024. There's no guarantee one machine will produce the same results as the next. I know this is true for the Hadley models. Is it also true for OIFS? I should add that as expected, under WINE all the output files came out as identical between a Wine installation of BOINC and one running under Windows in a VM. I have now deleted the cloned profile to avoid accidentally confusing the BOINC server software. ID: 70025 · Reply Quote