climateprediction.net (CPDN) home page
Thread 'Completing a WU? Impossible. What am i doing wrong?'

Thread 'Completing a WU? Impossible. What am i doing wrong?'

Message boards : Number crunching : Completing a WU? Impossible. What am i doing wrong?
Message board moderation

To post messages, you must log in.

AuthorMessage
Luca

Send message
Joined: 20 Jan 23
Posts: 3
Credit: 338,203
RAC: 1,606
Message 70004 - Posted: 28 Oct 2023, 19:17:19 UTC

Apart from PrimeGrid this project has the longest WU i've ever seen.

I'm fine with that and i would love to contribute.

But there's a huge problem. Everytime i open and close BOINC or turn off or restart my PC i risk to lose the WU i was working on.

Right now i was around 40%, i restarted and the WU disappeared again. What am i doing wrong?

I can't leave it on and working 24/7.
ID: 70004 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,855
RAC: 4,073
Message 70005 - Posted: 28 Oct 2023, 19:40:02 UTC - in response to Message 70004.  

Sadly this is a known problem with some (many?) of recent CPDN batches of work. It's all to do with the way the task saves and restores the file required to do a restart not working as intended.....
ID: 70005 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70007 - Posted: 28 Oct 2023, 20:05:24 UTC - in response to Message 70005.  

If you can suspend to disk instead of just turning off, that seems to at least greatly reduce the problem. I have two tasks of the current batch running in a VM and saving the machine state several times hasn't lost them so far though i hope I don't need to do so again before they finish.
ID: 70007 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,490,541
RAC: 15,784
Message 70008 - Posted: 29 Oct 2023, 9:34:10 UTC

This particular Weather@Home batch has problems restarting from its checkpoint files. Tasks often fail attempting to restart. It has to restart every time boinc is shutdown (PC poweroff), or, when computation is suspended and the task is moved out of memory.

As said previously, if you can it's good practice to set 'Leave non-GPU tasks in memory while suspended', as this means *any* task from any project doesn't need to restart from its checkpoint files on disk, which means the task will finish quicker. For Weather@Home it reduces the possibility of failure.

It's a known problem which we're working on.
---
CPDN Visiting Scientist
ID: 70008 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,374,828
RAC: 10,749
Message 70009 - Posted: 29 Oct 2023, 11:33:36 UTC - in response to Message 70004.  

... I can't leave it on and working 24/7.
'Sleep' rather than 'power off' will keep the Windoze system in memory and works OK for me with WAH2 tasks. Additionally, manually control 'Windoze Update' with 'Pause' for as long as it will let you and restart the updates after the tasks have finished. 'Hibernate' should behave similarly, saving to disc, although I have not tested the feature with WAH2 tasks
ID: 70009 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,490,541
RAC: 15,784
Message 70010 - Posted: 29 Oct 2023, 14:57:35 UTC - in response to Message 70009.  
Last modified: 29 Oct 2023, 14:58:18 UTC

... I can't leave it on and working 24/7.
'Sleep' rather than 'power off' will keep the Windoze system in memory and works OK for me with WAH2 tasks. Additionally, manually control 'Windoze Update' with 'Pause' for as long as it will let you and restart the updates after the tasks have finished. 'Hibernate' should behave similarly, saving to disc, although I have not tested the feature with WAH2 tasks
I tried that some time ago and it did not work (Windows 11), I still lost WaH tasks upon wake after sleep. My guess was the system was still pushing tasks out of RAM to swap for whatever reason. My experienced failure rate is about 1 in 5 tasks on a client start so it might just be luck?

This has been a problem with WaH for a while. There's a bug in the code that we're looking for.
ID: 70010 · Report as offensive     Reply Quote
Bastian Baum

Send message
Joined: 18 Aug 18
Posts: 1
Credit: 5,043,538
RAC: 10,197
Message 70011 - Posted: 29 Oct 2023, 19:58:51 UTC
Last modified: 29 Oct 2023, 20:09:00 UTC

But if it's a bug in the code, there's something that irritates me.

Recently I got a task from Batch 994 on one of my old systems (Intel Core2 Duo CPU T5900 from 2008, 32-Bit-Architecture + Windows 10 64-Bit + Boinc 7.22.2). Of course it's a slow system but I finished the task successfully in 40 days and 15 hours with 34+ Windows-Restarts!
I am monitoring some Apps with this system during the day and I usually shut it down every evening.
You can count the "Quit request from BOINC" in the stderr.txt file here: https://www.cpdn.org/result.php?resultid=22326043
Now I got another task from Batch 996 on the system and until now I shut it down and restarted the system 8 or 9 times.
This task seems to be stable, too. Look here: https://www.cpdn.org/result.php?resultid=22347602

This is just a subjective impression, but as newer my systems get, as higher is the rate of crashed tasks during the restart.
I have a small Server with a Intel Xeon Scaleable CPU from 2019 with some VMs on it, and it seems that I can finish 3-6 tasks of about 40 tasks from Batch 996 that I caught (loss-rate 85 - 92.5 % during two windows-update-restarts).

My newest System has a AMD Ryzen 5 5625U CPU from 2023 (Windows 11 + Boinc 7.24.1) got 14 tasks and I lost all task during the first restart (loss-rate: 100%).

Of course I know that the analysis is more complex than just looking at loss-rates and CPUs, I ignored the RAM for example. Maybe the minimal task-sample-size of about 60 tasks misleads my thoughts and I only had luck with the tasks on my older machine. But I am wondering why tasks can survive 34 or more restarts from checkpoints on a slow machine, but crash on newer, faster machines?
If there is a structual bug in the code, shouldn't this affect all systems with a nearly equal rate?
Or do the restart crashes depend on how fast the files are loaded into the RAM or is it "old" app-code not compatible with "new" system architectures(32-bit vs. 64-bit for example)?
Would it be a temporary solution for users to use older systems to avoid too much crashes of the recent instable tasks?
ID: 70011 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,490,541
RAC: 15,784
Message 70016 - Posted: 30 Oct 2023, 14:27:09 UTC - in response to Message 70011.  
Last modified: 30 Oct 2023, 14:27:24 UTC

Yes, I've noticed this apparent behaviour too. I think this is down to the way the code is compiled; the compiler is told to add in conditional code for different instruction sets depending what chip it finds. For instance, instructions for SSE2/3/4.* are included. When the code executes it will use different assembly instructions depending on what capabilities it finds on the chip. Older chips therefore will not necessarily be using the same assembler as more recent chips. That's great for speed but a bugger for debugging.

The executable sent out with WaH is also old. It's not been recompiled for many years and that could also be introducing issues.
---
CPDN Visiting Scientist
ID: 70016 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,890,678
RAC: 18,887
Message 70020 - Posted: 30 Oct 2023, 20:42:50 UTC - in response to Message 70016.  

Yes, I've noticed this apparent behaviour too. I think this is down to the way the code is compiled; the compiler is told to add in conditional code for different instruction sets depending what chip it finds. For instance, instructions for SSE2/3/4.* are included. When the code executes it will use different assembly instructions depending on what capabilities it finds on the chip. Older chips therefore will not necessarily be using the same assembler as more recent chips. That's great for speed but a bugger for debugging.

The executable sent out with WaH is also old. It's not been recompiled for many years and that could also be introducing issues.

The above sounds like more likely possible explanations than some of the previous ones given, at least to me it does.

Have you or anyone else tried recompiling the executable and running CPDN via the Anonymous Platform setup (which allows you to use your own executables)? That could give some useful info in finding the problem. Anonymous Platform setup is described here: https://boinc.berkeley.edu/wiki/Anonymous_platform
ID: 70020 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,490,541
RAC: 15,784
Message 70021 - Posted: 31 Oct 2023, 8:46:15 UTC - in response to Message 70020.  
Last modified: 31 Oct 2023, 8:46:37 UTC

I'm working on the model standalone in Linux, then we'll rebuild in Windows and finally move it to the boinc platform. One step at a time.

Richard introduced me to the anonymous platform a while ago but CPDN do not support it.
ID: 70021 · Report as offensive     Reply Quote
Drago75

Send message
Joined: 8 Jan 22
Posts: 9
Credit: 1,780,471
RAC: 3,152
Message 70023 - Posted: 31 Oct 2023, 10:11:53 UTC - in response to Message 70021.  

This may have been answered already in another thread but my question fits this subject. While crunching the wus they send intermediate progress checkpoints back to your server, I believe they are refered to as "trickles" for which we are awarded credit. If a wu fails after that trickle-save does that mean that that wu is sent out to another volunteer from that point or from scratch?

Tom
ID: 70023 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,490,541
RAC: 15,784
Message 70024 - Posted: 31 Oct 2023, 10:26:37 UTC - in response to Message 70023.  

This may have been answered already in another thread but my question fits this subject. While crunching the wus they send intermediate progress checkpoints back to your server, I believe they are refered to as "trickles" for which we are awarded credit. If a wu fails after that trickle-save does that mean that that wu is sent out to another volunteer from that point or from scratch?
It starts from scratch, the beginning of the model run. You get credit for computing work done up to the last trickle, which won't be far from the point the model fails.

It's better to do it that way rather than try to send out the latest checkpoint/restart files to the next user, as it keeps the results consistent. There's no guarantee one machine will produce the same results as the next.
ID: 70024 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 70025 - Posted: 31 Oct 2023, 11:16:40 UTC - in response to Message 70024.  

There's no guarantee one machine will produce the same results as the next.
I know this is true for the Hadley models. Is it also true for OIFS?

I should add that as expected, under WINE all the output files came out as identical between a Wine installation of BOINC and one running under Windows in a VM. I have now deleted the cloned profile to avoid accidentally confusing the BOINC server software.
ID: 70025 · Report as offensive     Reply Quote

Message boards : Number crunching : Completing a WU? Impossible. What am i doing wrong?

©2024 cpdn.org