Message boards : Number crunching : incoherent progress numbers
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 Jan 05 Posts: 41 Credit: 4,607,175 RAC: 889 |
Hello, I have a question regarding a WU I am running. From the beginning the number "Progress", "Time elapsed" and "Estimated remaining time to completion" did not make sense. It started with the progress being shown as approx. 1% per elapsed day of crunching. At the same time the total time for completion was about 25-30 days. Now, the numbers got even more crazy. Right now it is: Progress: 1.997% Time elapsed: 11d 13:02:00 (but processor time is only 8:41 even htough the computer runs several hours a day) Estimated remaining time: 16d 16:24:25 Is there any way for me to find out whether this unit is working correctly? Anwendung Weather At Home 2 (wah2) (region independent) 8.32 Name wah2_eas25_g2lv_201712_24_1023_012320594 Status Aktiv erhalten 24.10.2024 13:42:08 Ablaufdatum 02.01.2025 12:42:07 Geschätzter Berechnungsaufwand 3.801.388 GFLOPs Prozessorzeit 08:41:35 Prozessor-Zeit seit dem letzten Checkpoint 00:18:05 bisherige Laufzeit 11d 13:04:11 Geschätzte verbleibende Zeit 16d 16:23:17 Fortschritt 2,004% benötigter Arbeitsspeicher 306,58 MB Größe des Arbeitspakets 289,04 MB Verzeichnis slots/14 Prozess-Nr. 13364 Fortschrittsrate 0,360% pro Stunde Ausführbare Datei wah2_8.32_windows_intelx86.exe Thank you very much! Friedrich I love CPDN! -- |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
If you look at the task page and trickle listing for that task https://main.cpdn.org/result.php?resultid=22519941, you'll see that it was running along okay, with trickles every 1.5-3 days through Nov 12, and then restarted sometime between the 12th and 15th because on the 15th, you got a first duplicate trickle, and then another duplicate trickle on the 17th. Since then, there have been no trickles and I assume it restarted again at the beginning given you say progress is at ~2% now. I'm not sure what is going on. There are some odd repeat trickles in some of your previous tasks as well, even the successful runs, for example in this one: https://main.cpdn.org/result.php?resultid=22478248. Perhaps someone else has an idea of what is going on there. You could suspend the task until someone else comes along with a better suggestion. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,997,390 RAC: 21,721 |
It may not be related but tomorrow I will look at the page for a task I aborted that wasn't making progress. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Hi Friedrich, I looked at the task page for your workunit and the log shows the model is constantly restarting. I suspect the task is frequently suspended and kicked out of memory. You can change this. Open boincmgr and under the 'Options' menu, select 'Computing preferences'. On the 'Computing' tab, near the bottom you will see 'Leave non-GPU tasks in memory while suspended'. Make sure that is checked on (I think yours is off). Then click Save. That will fix it. We recommend having this on for all CPDN workunits. When the task is suspended the process gets kicked out of memory. When it restarts it has to go back to the last set of checkpoint files and repeat some of the forecast. If that happens frequently, the task can spend a lot of time repeating forecast steps it's already done. That's why I think you see a lot of compute time for not much progress. Writing checkpoint files is quite an expensive step for these models so we don't do it as often as other projects do. You could also take a look at the other boincmgr computing options and reduce the amount of time the task is suspended. Hope that helps. --- CPDN Visiting Scientist |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
In reply to Glenn Carver's message of 22 Nov 2024: Hi Friedrich, That might explain the repeated trickles in the task that finished with a status of success. But the one that is running now, got up through 9 trickles, then went back to trickle #1 and #2 after that, and now appears to be starting over again. |
©2024 cpdn.org