climateprediction.net (CPDN) home page
Thread 'incoherent progress numbers'

Thread 'incoherent progress numbers'

Message boards : Number crunching : incoherent progress numbers
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileFriedrich S.

Send message
Joined: 22 Jan 05
Posts: 45
Credit: 4,608,831
RAC: 828
Message 71876 - Posted: 22 Nov 2024, 15:51:01 UTC

Hello,

I have a question regarding a WU I am running.
From the beginning the number "Progress", "Time elapsed" and "Estimated remaining time to completion" did not make sense.
It started with the progress being shown as approx. 1% per elapsed day of crunching. At the same time the total time for completion was about 25-30 days.
Now, the numbers got even more crazy. Right now it is:
Progress: 1.997%
Time elapsed: 11d 13:02:00 (but processor time is only 8:41 even htough the computer runs several hours a day)
Estimated remaining time: 16d 16:24:25
Is there any way for me to find out whether this unit is working correctly?

Anwendung
Weather At Home 2 (wah2) (region independent) 8.32
Name
wah2_eas25_g2lv_201712_24_1023_012320594
Status
Aktiv
erhalten
24.10.2024 13:42:08
Ablaufdatum
02.01.2025 12:42:07
Geschätzter Berechnungsaufwand
3.801.388 GFLOPs
Prozessorzeit
08:41:35
Prozessor-Zeit seit dem letzten Checkpoint
00:18:05
bisherige Laufzeit
11d 13:04:11
Geschätzte verbleibende Zeit
16d 16:23:17
Fortschritt
2,004%
benötigter Arbeitsspeicher
306,58 MB
Größe des Arbeitspakets
289,04 MB
Verzeichnis
slots/14
Prozess-Nr.
13364
Fortschrittsrate
0,360% pro Stunde
Ausführbare Datei
wah2_8.32_windows_intelx86.exe

Thank you very much!

Friedrich
I love CPDN!
--
ID: 71876 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 71877 - Posted: 22 Nov 2024, 19:05:49 UTC - in response to Message 71876.  
Last modified: 22 Nov 2024, 19:08:41 UTC

If you look at the task page and trickle listing for that task https://main.cpdn.org/result.php?resultid=22519941, you'll see that it was running along okay, with trickles every 1.5-3 days through Nov 12, and then restarted sometime between the 12th and 15th because on the 15th, you got a first duplicate trickle, and then another duplicate trickle on the 17th. Since then, there have been no trickles and I assume it restarted again at the beginning given you say progress is at ~2% now.

I'm not sure what is going on. There are some odd repeat trickles in some of your previous tasks as well, even the successful runs, for example in this one: https://main.cpdn.org/result.php?resultid=22478248.

Perhaps someone else has an idea of what is going on there. You could suspend the task until someone else comes along with a better suggestion.
ID: 71877 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71878 - Posted: 22 Nov 2024, 21:55:28 UTC - in response to Message 71877.  

It may not be related but tomorrow I will look at the page for a task I aborted that wasn't making progress.
ID: 71878 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 71879 - Posted: 22 Nov 2024, 22:22:22 UTC - in response to Message 71876.  
Last modified: 22 Nov 2024, 22:24:12 UTC

Hi Friedrich,
I looked at the task page for your workunit and the log shows the model is constantly restarting. I suspect the task is frequently suspended and kicked out of memory. You can change this.

Open boincmgr and under the 'Options' menu, select 'Computing preferences'.
On the 'Computing' tab, near the bottom you will see 'Leave non-GPU tasks in memory while suspended'.
Make sure that is checked on (I think yours is off).
Then click Save.

That will fix it. We recommend having this on for all CPDN workunits.

When the task is suspended the process gets kicked out of memory. When it restarts it has to go back to the last set of checkpoint files and repeat some of the forecast. If that happens frequently, the task can spend a lot of time repeating forecast steps it's already done. That's why I think you see a lot of compute time for not much progress.

Writing checkpoint files is quite an expensive step for these models so we don't do it as often as other projects do.

You could also take a look at the other boincmgr computing options and reduce the amount of time the task is suspended.

Hope that helps.
---
CPDN Visiting Scientist
ID: 71879 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 71880 - Posted: 23 Nov 2024, 1:52:57 UTC - in response to Message 71879.  

In reply to Glenn Carver's message of 22 Nov 2024:
Hi Friedrich,
I looked at the task page for your workunit and the log shows the model is constantly restarting. I suspect the task is frequently suspended and kicked out of memory. You can change this.

Open boincmgr and under the 'Options' menu, select 'Computing preferences'.
On the 'Computing' tab, near the bottom you will see 'Leave non-GPU tasks in memory while suspended'.
Make sure that is checked on (I think yours is off).
Then click Save.

That will fix it. We recommend having this on for all CPDN workunits.

When the task is suspended the process gets kicked out of memory. When it restarts it has to go back to the last set of checkpoint files and repeat some of the forecast. If that happens frequently, the task can spend a lot of time repeating forecast steps it's already done. That's why I think you see a lot of compute time for not much progress.

Writing checkpoint files is quite an expensive step for these models so we don't do it as often as other projects do.

You could also take a look at the other boincmgr computing options and reduce the amount of time the task is suspended.

Hope that helps.


That might explain the repeated trickles in the task that finished with a status of success. But the one that is running now, got up through 9 trickles, then went back to trickle #1 and #2 after that, and now appears to be starting over again.
ID: 71880 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71881 - Posted: 23 Nov 2024, 8:05:32 UTC
Last modified: 23 Nov 2024, 8:20:01 UTC

this is my one the first trickle repeated multiple times. Not quite sure why it might have been running out of memory as I had previously been running 12 tasks in the VM and this was happening with only 8 and the Windows VM has 24GB allocated to it. I shall do some more digging if it happens again.

Edit: BOINC is set to use 95% of available memory both when computer is in use and not in use.
ID: 71881 · Report as offensive     Reply Quote
ProfileFriedrich S.

Send message
Joined: 22 Jan 05
Posts: 45
Credit: 4,608,831
RAC: 828
Message 71882 - Posted: 23 Nov 2024, 10:02:24 UTC - in response to Message 71880.  

Good morning,

thanks to all of you for your advice.
I had a problem with CPDN WU not being properly computed earlier, were a lot of your proposed issues were the reason for the problem (too many WU with not enough cores,...).

But since then I am using special prefs whenever I get CPDN WU:
- CPDN has a workshare of 1000 (out of 1050 total), which means that CPDN is running continuously, only other projects are suspended.
- use at most 50% of the CPU
- leave non-GPU in memory
So CPDN should not be dumped from the memory at any time; I have never seen it as suspended.
The computer - when turned on - runs continuously for at least 3-4 hours, which should be long enough to save some time steps and pick up from there.

The only reason for memory problems I can see is that I only have 16GB of RAM and switch between computer in use (use max. 75% of memory) and not in use (use max. 90% of memory). Maybe something happens during these to states...? (I now leave it at 75% all the time).

What do you think?

Thank you very much!

Friedrich
I love CPDN!
--
ID: 71882 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71883 - Posted: 23 Nov 2024, 10:44:19 UTC
Last modified: 23 Nov 2024, 10:45:30 UTC

The only reason for memory problems I can see is that I only have 16GB of RAM

How many tasks are you using at once? In theory at least even using 12 real cores that should be enough RAM assuming you are not running anything other than BOINC that uses a lot.
ID: 71883 · Report as offensive     Reply Quote
ProfileFriedrich S.

Send message
Joined: 22 Jan 05
Posts: 45
Credit: 4,608,831
RAC: 828
Message 71884 - Posted: 23 Nov 2024, 12:34:22 UTC - in response to Message 71883.  

I have set the limit to 3 CPDN Tasks, but currently only have a single on running.
I love CPDN!
--
ID: 71884 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 71885 - Posted: 23 Nov 2024, 12:38:18 UTC - in response to Message 71882.  

Restarting the machine every 3 to 4 hours is incredibly wasteful for CPDN, could you not Sleep or Hibernate instead?
ID: 71885 · Report as offensive     Reply Quote
ProfileFriedrich S.

Send message
Joined: 22 Jan 05
Posts: 45
Credit: 4,608,831
RAC: 828
Message 71886 - Posted: 23 Nov 2024, 20:12:20 UTC - in response to Message 71885.  

I know that it would be better to have it running 24/7, but due to the energy cost...

But as it checkpoints approx. every 45 minutes, it should nevertheless progress much better than it did this time. And most of the older WU did. It is just the second WU that shows this different behavior.
But I am assuming that at every checkpoint it saves the work achieved. Or is that a wrong assumption? And if so, how often is the achieved work being saved?

Thanks!

Friedrich
ID: 71886 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71887 - Posted: 23 Nov 2024, 20:49:44 UTC - in response to Message 71886.  

In reply to Friedrich S.'s message of 23 Nov 2024:
I know that it would be better to have it running 24/7, but due to the energy cost...

But as it checkpoints approx. every 45 minutes, it should nevertheless progress much better than it did this time. And most of the older WU did. It is just the second WU that shows this different behavior.
But I am assuming that at every checkpoint it saves the work achieved. Or is that a wrong assumption? And if so, how often is the achieved work being saved?

Thanks!

Friedrich

The time between checkpoints can vary quite a bit between task types and processor speeds. It is based on progress as opposed to time is my understanding. If these particular tasks are checkpointing every 45 minutes on your machine, there will be occasions when unless you are checking before shutting down you do so just before a checkpoint resulting in close to 45 minutes worth of computation needing to be repeated. This will significantly increase your electricity bill/task completed. As Bryan suggests, sleep or hibernation would be better. Both use very little electricity, indeed, my current system even if I don't do that but just stop all work, my box consumes so little power, the, "Intelligent" multi-socket I use, if computer is plugged into the master socket, does not draw enough to keep the other sockets live. Before I realised this, I thought I had a problem with my then brand new computer I had just installed the latest XUbuntu on.
ID: 71887 · Report as offensive     Reply Quote
ProfileFriedrich S.

Send message
Joined: 22 Jan 05
Posts: 45
Credit: 4,608,831
RAC: 828
Message 71888 - Posted: 23 Nov 2024, 20:53:20 UTC - in response to Message 71886.  

I really don't understand that WU.
I had my computer running for almost 2 days now and it has started over with a trickle on 11/15/24 and today for timestep 11,819?
What causes a WU to completely restart?

Friedrich
ID: 71888 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 71889 - Posted: 23 Nov 2024, 21:10:52 UTC

What causes a WU to completely restart?
I wish I knew. I can't add more than what Glenn wrote. He understands the code and mechanisms far better than I am ever likely to. The task I linked to repeated the first trickle several times and the computer was running continually during that time, 24GB or RAM allocated to the VM running Windows 10 and nothing memory intensive running on the host machine. BOINC and the tasks I was running all that was running on the guest. I have still to go through the stderr on that task but suspect it will need someone with greater knowledge than I possess to pick anything out.
ID: 71889 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,490,630
RAC: 21,281
Message 71890 - Posted: 24 Nov 2024, 0:28:51 UTC - in response to Message 71888.  

In reply to Friedrich S.'s message of 23 Nov 2024:
What causes a WU to completely restart?

In my experience re-starting from same (or earlier) point means one of:
1: Model didn't run long enough to reach next checkpoint.
2: Model crashed, but instead of erroring-out it looks like model continuing to run but isn't doing any no real work. Re-starting will go back to last checkpoint before crash and model will crash again, being permanently stuck until you finally detect one of the models haven't trickled for multiple days while all other models on same computer have trickled.
3: Either corrupt checkpoint file(s), or CPDN bugged-out and didn't use checkpoint file(s) and instead re-started from model start.
4: You're restoring model from backup and re-run part of model.

While 1 can happen with all models, this should definitely not give duplicate trickles for different number of time-steps.
2 at least for me happened with the v8.24 application, but again you should definitely not go back and repeat 2 trickles unless where's something really wrong with application.
3 I've had with the v8.32 application, but at least for me it only re-started same model from zero once. Can't remember if it was only one model or multiple models that re-started from zero.

4 is basically a self-inflicted error and not really something CPDN can do anything with.

In case 3 is due to corrupt checkpoint-file, using double checkpoint-files, similar to how BOINC uses double client_state.xml, should decrease error-rates. Now sure keeping double checkpoint-files will increase disk usage, but total bytes written to disk should be the same.
ID: 71890 · Report as offensive     Reply Quote

Message boards : Number crunching : incoherent progress numbers

©2024 cpdn.org