Message boards : Number crunching : incoherent progress numbers
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 Jan 05 Posts: 42 Credit: 4,607,175 RAC: 889 |
Hello, I have a question regarding a WU I am running. From the beginning the number "Progress", "Time elapsed" and "Estimated remaining time to completion" did not make sense. It started with the progress being shown as approx. 1% per elapsed day of crunching. At the same time the total time for completion was about 25-30 days. Now, the numbers got even more crazy. Right now it is: Progress: 1.997% Time elapsed: 11d 13:02:00 (but processor time is only 8:41 even htough the computer runs several hours a day) Estimated remaining time: 16d 16:24:25 Is there any way for me to find out whether this unit is working correctly? Anwendung Weather At Home 2 (wah2) (region independent) 8.32 Name wah2_eas25_g2lv_201712_24_1023_012320594 Status Aktiv erhalten 24.10.2024 13:42:08 Ablaufdatum 02.01.2025 12:42:07 Geschätzter Berechnungsaufwand 3.801.388 GFLOPs Prozessorzeit 08:41:35 Prozessor-Zeit seit dem letzten Checkpoint 00:18:05 bisherige Laufzeit 11d 13:04:11 Geschätzte verbleibende Zeit 16d 16:23:17 Fortschritt 2,004% benötigter Arbeitsspeicher 306,58 MB Größe des Arbeitspakets 289,04 MB Verzeichnis slots/14 Prozess-Nr. 13364 Fortschrittsrate 0,360% pro Stunde Ausführbare Datei wah2_8.32_windows_intelx86.exe Thank you very much! Friedrich I love CPDN! -- |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
If you look at the task page and trickle listing for that task https://main.cpdn.org/result.php?resultid=22519941, you'll see that it was running along okay, with trickles every 1.5-3 days through Nov 12, and then restarted sometime between the 12th and 15th because on the 15th, you got a first duplicate trickle, and then another duplicate trickle on the 17th. Since then, there have been no trickles and I assume it restarted again at the beginning given you say progress is at ~2% now. I'm not sure what is going on. There are some odd repeat trickles in some of your previous tasks as well, even the successful runs, for example in this one: https://main.cpdn.org/result.php?resultid=22478248. Perhaps someone else has an idea of what is going on there. You could suspend the task until someone else comes along with a better suggestion. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,002,360 RAC: 21,497 |
It may not be related but tomorrow I will look at the page for a task I aborted that wasn't making progress. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Hi Friedrich, I looked at the task page for your workunit and the log shows the model is constantly restarting. I suspect the task is frequently suspended and kicked out of memory. You can change this. Open boincmgr and under the 'Options' menu, select 'Computing preferences'. On the 'Computing' tab, near the bottom you will see 'Leave non-GPU tasks in memory while suspended'. Make sure that is checked on (I think yours is off). Then click Save. That will fix it. We recommend having this on for all CPDN workunits. When the task is suspended the process gets kicked out of memory. When it restarts it has to go back to the last set of checkpoint files and repeat some of the forecast. If that happens frequently, the task can spend a lot of time repeating forecast steps it's already done. That's why I think you see a lot of compute time for not much progress. Writing checkpoint files is quite an expensive step for these models so we don't do it as often as other projects do. You could also take a look at the other boincmgr computing options and reduce the amount of time the task is suspended. Hope that helps. --- CPDN Visiting Scientist |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
In reply to Glenn Carver's message of 22 Nov 2024: Hi Friedrich, That might explain the repeated trickles in the task that finished with a status of success. But the one that is running now, got up through 9 trickles, then went back to trickle #1 and #2 after that, and now appears to be starting over again. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,002,360 RAC: 21,497 |
this is my one the first trickle repeated multiple times. Not quite sure why it might have been running out of memory as I had previously been running 12 tasks in the VM and this was happening with only 8 and the Windows VM has 24GB allocated to it. I shall do some more digging if it happens again. Edit: BOINC is set to use 95% of available memory both when computer is in use and not in use. |
Send message Joined: 22 Jan 05 Posts: 42 Credit: 4,607,175 RAC: 889 |
Good morning, thanks to all of you for your advice. I had a problem with CPDN WU not being properly computed earlier, were a lot of your proposed issues were the reason for the problem (too many WU with not enough cores,...). But since then I am using special prefs whenever I get CPDN WU: - CPDN has a workshare of 1000 (out of 1050 total), which means that CPDN is running continuously, only other projects are suspended. - use at most 50% of the CPU - leave non-GPU in memory So CPDN should not be dumped from the memory at any time; I have never seen it as suspended. The computer - when turned on - runs continuously for at least 3-4 hours, which should be long enough to save some time steps and pick up from there. The only reason for memory problems I can see is that I only have 16GB of RAM and switch between computer in use (use max. 75% of memory) and not in use (use max. 90% of memory). Maybe something happens during these to states...? (I now leave it at 75% all the time). What do you think? Thank you very much! Friedrich I love CPDN! -- |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,002,360 RAC: 21,497 |
The only reason for memory problems I can see is that I only have 16GB of RAM How many tasks are you using at once? In theory at least even using 12 real cores that should be enough RAM assuming you are not running anything other than BOINC that uses a lot. |
©2024 cpdn.org