Message boards : Number crunching : Completing a WU? Impossible. What am i doing wrong?
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 23 Posts: 3 Credit: 338,203 RAC: 1,606 |
Apart from PrimeGrid this project has the longest WU i've ever seen. I'm fine with that and i would love to contribute. But there's a huge problem. Everytime i open and close BOINC or turn off or restart my PC i risk to lose the WU i was working on. Right now i was around 40%, i restarted and the WU disappeared again. What am i doing wrong? I can't leave it on and working 24/7. |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
Sadly this is a known problem with some (many?) of recent CPDN batches of work. It's all to do with the way the task saves and restores the file required to do a restart not working as intended..... |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
If you can suspend to disk instead of just turning off, that seems to at least greatly reduce the problem. I have two tasks of the current batch running in a VM and saving the machine state several times hasn't lost them so far though i hope I don't need to do so again before they finish. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784 |
This particular Weather@Home batch has problems restarting from its checkpoint files. Tasks often fail attempting to restart. It has to restart every time boinc is shutdown (PC poweroff), or, when computation is suspended and the task is moved out of memory. As said previously, if you can it's good practice to set 'Leave non-GPU tasks in memory while suspended', as this means *any* task from any project doesn't need to restart from its checkpoint files on disk, which means the task will finish quicker. For Weather@Home it reduces the possibility of failure. It's a known problem which we're working on. --- CPDN Visiting Scientist |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,374,828 RAC: 10,749 |
... I can't leave it on and working 24/7.'Sleep' rather than 'power off' will keep the Windoze system in memory and works OK for me with WAH2 tasks. Additionally, manually control 'Windoze Update' with 'Pause' for as long as it will let you and restart the updates after the tasks have finished. 'Hibernate' should behave similarly, saving to disc, although I have not tested the feature with WAH2 tasks |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784 |
I tried that some time ago and it did not work (Windows 11), I still lost WaH tasks upon wake after sleep. My guess was the system was still pushing tasks out of RAM to swap for whatever reason. My experienced failure rate is about 1 in 5 tasks on a client start so it might just be luck?... I can't leave it on and working 24/7.'Sleep' rather than 'power off' will keep the Windoze system in memory and works OK for me with WAH2 tasks. Additionally, manually control 'Windoze Update' with 'Pause' for as long as it will let you and restart the updates after the tasks have finished. 'Hibernate' should behave similarly, saving to disc, although I have not tested the feature with WAH2 tasks This has been a problem with WaH for a while. There's a bug in the code that we're looking for. |
Send message Joined: 18 Aug 18 Posts: 1 Credit: 5,043,538 RAC: 10,197 |
But if it's a bug in the code, there's something that irritates me. Recently I got a task from Batch 994 on one of my old systems (Intel Core2 Duo CPU T5900 from 2008, 32-Bit-Architecture + Windows 10 64-Bit + Boinc 7.22.2). Of course it's a slow system but I finished the task successfully in 40 days and 15 hours with 34+ Windows-Restarts! I am monitoring some Apps with this system during the day and I usually shut it down every evening. You can count the "Quit request from BOINC" in the stderr.txt file here: https://www.cpdn.org/result.php?resultid=22326043 Now I got another task from Batch 996 on the system and until now I shut it down and restarted the system 8 or 9 times. This task seems to be stable, too. Look here: https://www.cpdn.org/result.php?resultid=22347602 This is just a subjective impression, but as newer my systems get, as higher is the rate of crashed tasks during the restart. I have a small Server with a Intel Xeon Scaleable CPU from 2019 with some VMs on it, and it seems that I can finish 3-6 tasks of about 40 tasks from Batch 996 that I caught (loss-rate 85 - 92.5 % during two windows-update-restarts). My newest System has a AMD Ryzen 5 5625U CPU from 2023 (Windows 11 + Boinc 7.24.1) got 14 tasks and I lost all task during the first restart (loss-rate: 100%). Of course I know that the analysis is more complex than just looking at loss-rates and CPUs, I ignored the RAM for example. Maybe the minimal task-sample-size of about 60 tasks misleads my thoughts and I only had luck with the tasks on my older machine. But I am wondering why tasks can survive 34 or more restarts from checkpoints on a slow machine, but crash on newer, faster machines? If there is a structual bug in the code, shouldn't this affect all systems with a nearly equal rate? Or do the restart crashes depend on how fast the files are loaded into the RAM or is it "old" app-code not compatible with "new" system architectures(32-bit vs. 64-bit for example)? Would it be a temporary solution for users to use older systems to avoid too much crashes of the recent instable tasks? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784 |
Yes, I've noticed this apparent behaviour too. I think this is down to the way the code is compiled; the compiler is told to add in conditional code for different instruction sets depending what chip it finds. For instance, instructions for SSE2/3/4.* are included. When the code executes it will use different assembly instructions depending on what capabilities it finds on the chip. Older chips therefore will not necessarily be using the same assembler as more recent chips. That's great for speed but a bugger for debugging. The executable sent out with WaH is also old. It's not been recompiled for many years and that could also be introducing issues. --- CPDN Visiting Scientist |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,890,678 RAC: 18,887 |
Yes, I've noticed this apparent behaviour too. I think this is down to the way the code is compiled; the compiler is told to add in conditional code for different instruction sets depending what chip it finds. For instance, instructions for SSE2/3/4.* are included. When the code executes it will use different assembly instructions depending on what capabilities it finds on the chip. Older chips therefore will not necessarily be using the same assembler as more recent chips. That's great for speed but a bugger for debugging. The above sounds like more likely possible explanations than some of the previous ones given, at least to me it does. Have you or anyone else tried recompiling the executable and running CPDN via the Anonymous Platform setup (which allows you to use your own executables)? That could give some useful info in finding the problem. Anonymous Platform setup is described here: https://boinc.berkeley.edu/wiki/Anonymous_platform |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784 |
I'm working on the model standalone in Linux, then we'll rebuild in Windows and finally move it to the boinc platform. One step at a time. Richard introduced me to the anonymous platform a while ago but CPDN do not support it. |
Send message Joined: 8 Jan 22 Posts: 9 Credit: 1,780,471 RAC: 3,152 |
This may have been answered already in another thread but my question fits this subject. While crunching the wus they send intermediate progress checkpoints back to your server, I believe they are refered to as "trickles" for which we are awarded credit. If a wu fails after that trickle-save does that mean that that wu is sent out to another volunteer from that point or from scratch? Tom |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,490,541 RAC: 15,784 |
This may have been answered already in another thread but my question fits this subject. While crunching the wus they send intermediate progress checkpoints back to your server, I believe they are refered to as "trickles" for which we are awarded credit. If a wu fails after that trickle-save does that mean that that wu is sent out to another volunteer from that point or from scratch?It starts from scratch, the beginning of the model run. You get credit for computing work done up to the last trickle, which won't be far from the point the model fails. It's better to do it that way rather than try to send out the latest checkpoint/restart files to the next user, as it keeps the results consistent. There's no guarantee one machine will produce the same results as the next. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
There's no guarantee one machine will produce the same results as the next.I know this is true for the Hadley models. Is it also true for OIFS? I should add that as expected, under WINE all the output files came out as identical between a Wine installation of BOINC and one running under Windows in a VM. I have now deleted the cloned profile to avoid accidentally confusing the BOINC server software. |
©2024 cpdn.org