climateprediction.net (CPDN) home page
Thread 'incoherent progress numbers'

Thread 'incoherent progress numbers'

Message boards : Number crunching : incoherent progress numbers
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileFriedrich S.

Send message
Joined: 22 Jan 05
Posts: 41
Credit: 4,607,175
RAC: 889
Message 71876 - Posted: 22 Nov 2024, 15:51:01 UTC

Hello,

I have a question regarding a WU I am running.
From the beginning the number "Progress", "Time elapsed" and "Estimated remaining time to completion" did not make sense.
It started with the progress being shown as approx. 1% per elapsed day of crunching. At the same time the total time for completion was about 25-30 days.
Now, the numbers got even more crazy. Right now it is:
Progress: 1.997%
Time elapsed: 11d 13:02:00 (but processor time is only 8:41 even htough the computer runs several hours a day)
Estimated remaining time: 16d 16:24:25
Is there any way for me to find out whether this unit is working correctly?

Anwendung
Weather At Home 2 (wah2) (region independent) 8.32
Name
wah2_eas25_g2lv_201712_24_1023_012320594
Status
Aktiv
erhalten
24.10.2024 13:42:08
Ablaufdatum
02.01.2025 12:42:07
Geschätzter Berechnungsaufwand
3.801.388 GFLOPs
Prozessorzeit
08:41:35
Prozessor-Zeit seit dem letzten Checkpoint
00:18:05
bisherige Laufzeit
11d 13:04:11
Geschätzte verbleibende Zeit
16d 16:23:17
Fortschritt
2,004%
benötigter Arbeitsspeicher
306,58 MB
Größe des Arbeitspakets
289,04 MB
Verzeichnis
slots/14
Prozess-Nr.
13364
Fortschrittsrate
0,360% pro Stunde
Ausführbare Datei
wah2_8.32_windows_intelx86.exe

Thank you very much!

Friedrich
I love CPDN!
--
ID: 71876 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 71877 - Posted: 22 Nov 2024, 19:05:49 UTC - in response to Message 71876.  
Last modified: 22 Nov 2024, 19:08:41 UTC

If you look at the task page and trickle listing for that task https://main.cpdn.org/result.php?resultid=22519941, you'll see that it was running along okay, with trickles every 1.5-3 days through Nov 12, and then restarted sometime between the 12th and 15th because on the 15th, you got a first duplicate trickle, and then another duplicate trickle on the 17th. Since then, there have been no trickles and I assume it restarted again at the beginning given you say progress is at ~2% now.

I'm not sure what is going on. There are some odd repeat trickles in some of your previous tasks as well, even the successful runs, for example in this one: https://main.cpdn.org/result.php?resultid=22478248.

Perhaps someone else has an idea of what is going on there. You could suspend the task until someone else comes along with a better suggestion.
ID: 71877 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,997,390
RAC: 21,721
Message 71878 - Posted: 22 Nov 2024, 21:55:28 UTC - in response to Message 71877.  

It may not be related but tomorrow I will look at the page for a task I aborted that wasn't making progress.
ID: 71878 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 71879 - Posted: 22 Nov 2024, 22:22:22 UTC - in response to Message 71876.  
Last modified: 22 Nov 2024, 22:24:12 UTC

Hi Friedrich,
I looked at the task page for your workunit and the log shows the model is constantly restarting. I suspect the task is frequently suspended and kicked out of memory. You can change this.

Open boincmgr and under the 'Options' menu, select 'Computing preferences'.
On the 'Computing' tab, near the bottom you will see 'Leave non-GPU tasks in memory while suspended'.
Make sure that is checked on (I think yours is off).
Then click Save.

That will fix it. We recommend having this on for all CPDN workunits.

When the task is suspended the process gets kicked out of memory. When it restarts it has to go back to the last set of checkpoint files and repeat some of the forecast. If that happens frequently, the task can spend a lot of time repeating forecast steps it's already done. That's why I think you see a lot of compute time for not much progress.

Writing checkpoint files is quite an expensive step for these models so we don't do it as often as other projects do.

You could also take a look at the other boincmgr computing options and reduce the amount of time the task is suspended.

Hope that helps.
---
CPDN Visiting Scientist
ID: 71879 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 71880 - Posted: 23 Nov 2024, 1:52:57 UTC - in response to Message 71879.  

In reply to Glenn Carver's message of 22 Nov 2024:
Hi Friedrich,
I looked at the task page for your workunit and the log shows the model is constantly restarting. I suspect the task is frequently suspended and kicked out of memory. You can change this.

Open boincmgr and under the 'Options' menu, select 'Computing preferences'.
On the 'Computing' tab, near the bottom you will see 'Leave non-GPU tasks in memory while suspended'.
Make sure that is checked on (I think yours is off).
Then click Save.

That will fix it. We recommend having this on for all CPDN workunits.

When the task is suspended the process gets kicked out of memory. When it restarts it has to go back to the last set of checkpoint files and repeat some of the forecast. If that happens frequently, the task can spend a lot of time repeating forecast steps it's already done. That's why I think you see a lot of compute time for not much progress.

Writing checkpoint files is quite an expensive step for these models so we don't do it as often as other projects do.

You could also take a look at the other boincmgr computing options and reduce the amount of time the task is suspended.

Hope that helps.


That might explain the repeated trickles in the task that finished with a status of success. But the one that is running now, got up through 9 trickles, then went back to trickle #1 and #2 after that, and now appears to be starting over again.
ID: 71880 · Report as offensive     Reply Quote

Message boards : Number crunching : incoherent progress numbers

©2024 cpdn.org