Thread 'incoherent progress numbers'

Author	Message
Friedrich S. Send message Joined: 22 Jan 05 Posts: 45 Credit: 4,608,831 RAC: 828	Message 71876 - Posted: 22 Nov 2024, 15:51:01 UTC Hello, I have a question regarding a WU I am running. From the beginning the number "Progress", "Time elapsed" and "Estimated remaining time to completion" did not make sense. It started with the progress being shown as approx. 1% per elapsed day of crunching. At the same time the total time for completion was about 25-30 days. Now, the numbers got even more crazy. Right now it is: Progress: 1.997% Time elapsed: 11d 13:02:00 (but processor time is only 8:41 even htough the computer runs several hours a day) Estimated remaining time: 16d 16:24:25 Is there any way for me to find out whether this unit is working correctly? Anwendung Weather At Home 2 (wah2) (region independent) 8.32 Name wah2_eas25_g2lv_201712_24_1023_012320594 Status Aktiv erhalten 24.10.2024 13:42:08 Ablaufdatum 02.01.2025 12:42:07 Geschätzter Berechnungsaufwand 3.801.388 GFLOPs Prozessorzeit 08:41:35 Prozessor-Zeit seit dem letzten Checkpoint 00:18:05 bisherige Laufzeit 11d 13:04:11 Geschätzte verbleibende Zeit 16d 16:23:17 Fortschritt 2,004% benötigter Arbeitsspeicher 306,58 MB Größe des Arbeitspakets 289,04 MB Verzeichnis slots/14 Prozess-Nr. 13364 Fortschrittsrate 0,360% pro Stunde Ausführbare Datei wah2_8.32_windows_intelx86.exe Thank you very much! Friedrich I love CPDN! -- ID: 71876 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 71877 - Posted: 22 Nov 2024, 19:05:49 UTC - in response to Message 71876. Last modified: 22 Nov 2024, 19:08:41 UTC If you look at the task page and trickle listing for that task https://main.cpdn.org/result.php?resultid=22519941, you'll see that it was running along okay, with trickles every 1.5-3 days through Nov 12, and then restarted sometime between the 12th and 15th because on the 15th, you got a first duplicate trickle, and then another duplicate trickle on the 17th. Since then, there have been no trickles and I assume it restarted again at the beginning given you say progress is at ~2% now. I'm not sure what is going on. There are some odd repeat trickles in some of your previous tasks as well, even the successful runs, for example in this one: https://main.cpdn.org/result.php?resultid=22478248. Perhaps someone else has an idea of what is going on there. You could suspend the task until someone else comes along with a better suggestion. ID: 71877 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 71878 - Posted: 22 Nov 2024, 21:55:28 UTC - in response to Message 71877. It may not be related but tomorrow I will look at the page for a task I aborted that wasn't making progress. ID: 71878 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681	Message 71879 - Posted: 22 Nov 2024, 22:22:22 UTC - in response to Message 71876. Last modified: 22 Nov 2024, 22:24:12 UTC Hi Friedrich, I looked at the task page for your workunit and the log shows the model is constantly restarting. I suspect the task is frequently suspended and kicked out of memory. You can change this. Open boincmgr and under the 'Options' menu, select 'Computing preferences'. On the 'Computing' tab, near the bottom you will see 'Leave non-GPU tasks in memory while suspended'. Make sure that is checked on (I think yours is off). Then click Save. That will fix it. We recommend having this on for all CPDN workunits. When the task is suspended the process gets kicked out of memory. When it restarts it has to go back to the last set of checkpoint files and repeat some of the forecast. If that happens frequently, the task can spend a lot of time repeating forecast steps it's already done. That's why I think you see a lot of compute time for not much progress. Writing checkpoint files is quite an expensive step for these models so we don't do it as often as other projects do. You could also take a look at the other boincmgr computing options and reduce the amount of time the task is suspended. Hope that helps. --- CPDN Visiting Scientist ID: 71879 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 71880 - Posted: 23 Nov 2024, 1:52:57 UTC - in response to Message 71879. In reply to Glenn Carver's message of 22 Nov 2024: Hi Friedrich, I looked at the task page for your workunit and the log shows the model is constantly restarting. I suspect the task is frequently suspended and kicked out of memory. You can change this. Open boincmgr and under the 'Options' menu, select 'Computing preferences'. On the 'Computing' tab, near the bottom you will see 'Leave non-GPU tasks in memory while suspended'. Make sure that is checked on (I think yours is off). Then click Save. That will fix it. We recommend having this on for all CPDN workunits. When the task is suspended the process gets kicked out of memory. When it restarts it has to go back to the last set of checkpoint files and repeat some of the forecast. If that happens frequently, the task can spend a lot of time repeating forecast steps it's already done. That's why I think you see a lot of compute time for not much progress. Writing checkpoint files is quite an expensive step for these models so we don't do it as often as other projects do. You could also take a look at the other boincmgr computing options and reduce the amount of time the task is suspended. Hope that helps. That might explain the repeated trickles in the task that finished with a status of success. But the one that is running now, got up through 9 trickles, then went back to trickle #1 and #2 after that, and now appears to be starting over again. ID: 71880 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 71881 - Posted: 23 Nov 2024, 8:05:32 UTC Last modified: 23 Nov 2024, 8:20:01 UTC this is my one the first trickle repeated multiple times. Not quite sure why it might have been running out of memory as I had previously been running 12 tasks in the VM and this was happening with only 8 and the Windows VM has 24GB allocated to it. I shall do some more digging if it happens again. Edit: BOINC is set to use 95% of available memory both when computer is in use and not in use. ID: 71881 · Reply Quote

Friedrich S. Send message Joined: 22 Jan 05 Posts: 45 Credit: 4,608,831 RAC: 828	Message 71882 - Posted: 23 Nov 2024, 10:02:24 UTC - in response to Message 71880. Good morning, thanks to all of you for your advice. I had a problem with CPDN WU not being properly computed earlier, were a lot of your proposed issues were the reason for the problem (too many WU with not enough cores,...). But since then I am using special prefs whenever I get CPDN WU: - CPDN has a workshare of 1000 (out of 1050 total), which means that CPDN is running continuously, only other projects are suspended. - use at most 50% of the CPU - leave non-GPU in memory So CPDN should not be dumped from the memory at any time; I have never seen it as suspended. The computer - when turned on - runs continuously for at least 3-4 hours, which should be long enough to save some time steps and pick up from there. The only reason for memory problems I can see is that I only have 16GB of RAM and switch between computer in use (use max. 75% of memory) and not in use (use max. 90% of memory). Maybe something happens during these to states...? (I now leave it at 75% all the time). What do you think? Thank you very much! Friedrich I love CPDN! -- ID: 71882 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 71883 - Posted: 23 Nov 2024, 10:44:19 UTC Last modified: 23 Nov 2024, 10:45:30 UTC The only reason for memory problems I can see is that I only have 16GB of RAM How many tasks are you using at once? In theory at least even using 12 real cores that should be enough RAM assuming you are not running anything other than BOINC that uses a lot. ID: 71883 · Reply Quote

Friedrich S. Send message Joined: 22 Jan 05 Posts: 45 Credit: 4,608,831 RAC: 828	Message 71884 - Posted: 23 Nov 2024, 12:34:22 UTC - in response to Message 71883. I have set the limit to 3 CPDN Tasks, but currently only have a single on running. I love CPDN! -- ID: 71884 · Reply Quote

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228	Message 71885 - Posted: 23 Nov 2024, 12:38:18 UTC - in response to Message 71882. Restarting the machine every 3 to 4 hours is incredibly wasteful for CPDN, could you not Sleep or Hibernate instead? ID: 71885 · Reply Quote

Friedrich S. Send message Joined: 22 Jan 05 Posts: 45 Credit: 4,608,831 RAC: 828	Message 71886 - Posted: 23 Nov 2024, 20:12:20 UTC - in response to Message 71885. I know that it would be better to have it running 24/7, but due to the energy cost... But as it checkpoints approx. every 45 minutes, it should nevertheless progress much better than it did this time. And most of the older WU did. It is just the second WU that shows this different behavior. But I am assuming that at every checkpoint it saves the work achieved. Or is that a wrong assumption? And if so, how often is the achieved work being saved? Thanks! Friedrich ID: 71886 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 71887 - Posted: 23 Nov 2024, 20:49:44 UTC - in response to Message 71886. In reply to Friedrich S.'s message of 23 Nov 2024: I know that it would be better to have it running 24/7, but due to the energy cost... But as it checkpoints approx. every 45 minutes, it should nevertheless progress much better than it did this time. And most of the older WU did. It is just the second WU that shows this different behavior. But I am assuming that at every checkpoint it saves the work achieved. Or is that a wrong assumption? And if so, how often is the achieved work being saved? Thanks! Friedrich The time between checkpoints can vary quite a bit between task types and processor speeds. It is based on progress as opposed to time is my understanding. If these particular tasks are checkpointing every 45 minutes on your machine, there will be occasions when unless you are checking before shutting down you do so just before a checkpoint resulting in close to 45 minutes worth of computation needing to be repeated. This will significantly increase your electricity bill/task completed. As Bryan suggests, sleep or hibernation would be better. Both use very little electricity, indeed, my current system even if I don't do that but just stop all work, my box consumes so little power, the, "Intelligent" multi-socket I use, if computer is plugged into the master socket, does not draw enough to keep the other sockets live. Before I realised this, I thought I had a problem with my then brand new computer I had just installed the latest XUbuntu on. ID: 71887 · Reply Quote

Friedrich S. Send message Joined: 22 Jan 05 Posts: 45 Credit: 4,608,831 RAC: 828	Message 71888 - Posted: 23 Nov 2024, 20:53:20 UTC - in response to Message 71886. I really don't understand that WU. I had my computer running for almost 2 days now and it has started over with a trickle on 11/15/24 and today for timestep 11,819? What causes a WU to completely restart? Friedrich ID: 71888 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 71889 - Posted: 23 Nov 2024, 21:10:52 UTC What causes a WU to completely restart? I wish I knew. I can't add more than what Glenn wrote. He understands the code and mechanisms far better than I am ever likely to. The task I linked to repeated the first trickle several times and the computer was running continually during that time, 24GB or RAM allocated to the VM running Windows 10 and nothing memory intensive running on the host machine. BOINC and the tasks I was running all that was running on the guest. I have still to go through the stderr on that task but suspect it will need someone with greater knowledge than I possess to pick anything out. ID: 71889 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,487,317 RAC: 21,130	Message 71890 - Posted: 24 Nov 2024, 0:28:51 UTC - in response to Message 71888. In reply to Friedrich S.'s message of 23 Nov 2024: What causes a WU to completely restart? In my experience re-starting from same (or earlier) point means one of: 1: Model didn't run long enough to reach next checkpoint. 2: Model crashed, but instead of erroring-out it looks like model continuing to run but isn't doing any no real work. Re-starting will go back to last checkpoint before crash and model will crash again, being permanently stuck until you finally detect one of the models haven't trickled for multiple days while all other models on same computer have trickled. 3: Either corrupt checkpoint file(s), or CPDN bugged-out and didn't use checkpoint file(s) and instead re-started from model start. 4: You're restoring model from backup and re-run part of model. While 1 can happen with all models, this should definitely not give duplicate trickles for different number of time-steps. 2 at least for me happened with the v8.24 application, but again you should definitely not go back and repeat 2 trickles unless where's something really wrong with application. 3 I've had with the v8.32 application, but at least for me it only re-started same model from zero once. Can't remember if it was only one model or multiple models that re-started from zero. 4 is basically a self-inflicted error and not really something CPDN can do anything with. In case 3 is due to corrupt checkpoint-file, using double checkpoint-files, similar to how BOINC uses double client_state.xml, should decrease error-rates. Now sure keeping double checkpoint-files will increase disk usage, but total bytes written to disk should be the same. ID: 71890 · Reply Quote