Thread 'Tasks stuck for days'

Author	Message
Ryan Munro Send message Joined: 9 Nov 20 Posts: 6 Credit: 6,943,088 RAC: 2,922	Message 71513 - Posted: 20 Sep 2024, 10:46:48 UTC I have a few tasks now stuck at around 86% with 14d reported elapsed time, they have done longer than this as it has shown 14d for a while now, remaining time shows nothing. Should I leave them or abort them? Properties from the example unit Application Weather At Home 2 (wah2) (region independent) 8.29 Name wah2_eas25_a14t_201212_24_1015_012278165 State Suspended - user request Received 16/04/2024 11:54:52 Report deadline 25/06/2024 11:54:51 Estimated computation size 3,801,388 GFLOPs CPU time 10d 20:05:20 CPU time since checkpoint --- Elapsed time 14d 18:29:19 Estimated time remaining --- Fraction done 86.824% Virtual memory size 185.28 MB Working set size 75.98 MB Directory slots/39 Process ID 26168 Progress rate 0.360% per hour Executable wah2_8.29_windows_intelx86.exe ID: 71513 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4558 Credit: 19,039,635 RAC: 18,944	Message 71514 - Posted: 20 Sep 2024, 10:58:01 UTC Suspended - user request Did you suspend the task in order to see if setting it to run again would get it going? ID: 71514 · Reply Quote

Ryan Munro Send message Joined: 9 Nov 20 Posts: 6 Credit: 6,943,088 RAC: 2,922	Message 71521 - Posted: 20 Sep 2024, 16:46:08 UTC - in response to Message 71514. I suspend Boinc during the day and let it run overnight ID: 71521 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71523 - Posted: 20 Sep 2024, 17:03:31 UTC - in response to Message 71513. Last modified: 20 Sep 2024, 17:06:30 UTC Hi Ryan, Abort them. That task you showed is from batch 1015 (notice the _1015_ in the name). That batch was closed way back in July because there was a problem with that version of the Weather@Home app. Abort any other tasks from 1015. Glenn --- CPDN Visiting Scientist ID: 71523 · Reply Quote

Ryan Munro Send message Joined: 9 Nov 20 Posts: 6 Credit: 6,943,088 RAC: 2,922	Message 71524 - Posted: 20 Sep 2024, 18:16:05 UTC - in response to Message 71523. Will do thanks :) ID: 71524 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 71546 - Posted: 22 Sep 2024, 18:49:14 UTC I just had something similar on newer batches due to unplanned power outage. Usually when such incidents happen, tasks would either successfully resume, or error out with computation error soon after restart. However, this time I have 4 WUs that did not error out, nor making any progress. I noticed this when checking CPU utilization because they are not consuming any CPU cycles. After waiting for 15 minutes or so, I tried restarting the boinc client to force a restart but they were stuck in the same way. From the logs, they all have "Model crashed" in the output. The WUs that I had to manually abort are the following: https://main.cpdn.org/result.php?resultid=22468845 https://main.cpdn.org/result.php?resultid=22503701 https://main.cpdn.org/result.php?resultid=22505634 https://main.cpdn.org/result.php?resultid=22498243 Compared to the ones that error'ed out, which is the normal behavior that occasionally happens due to unplanned outage. (Still much reduced compared to 8.24) https://main.cpdn.org/result.php?resultid=22504490 https://main.cpdn.org/result.php?resultid=22476951 ID: 71546 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71551 - Posted: 23 Sep 2024, 12:05:50 UTC - in response to Message 71546. Last modified: 23 Sep 2024, 12:06:52 UTC I looked at the first task on that list and it failed with this error: Model crashed: READHIST: End of file in READ from history file for namelist NLIHISTO Any reference to a history file means one of the models was trying to restart when the client started the task. If it hits end of file it usually means the previous process did not finish writing the history (restart) file for some reason, or the file has been corrupted on disk. I've seen this behaviour in testing and logged an issue for it. I don't know if it's a code issue that's always been there or I've introduced. Without looking at Ryan's earlier tasks I can't say if it's the same behaviour. --- CPDN Visiting Scientist ID: 71551 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 71554 - Posted: 23 Sep 2024, 22:15:05 UTC - in response to Message 71551. Thanks for taking a look. This line indeed only appeared in the jobs that were stuck from my samples. I found Ryan's task with the same log line too: https://main.cpdn.org/result.php?resultid=22426672. From what you described, the task is likely not salvageable at that point. If it could report an error instead of hanging around, that would be rather helpful. Guess for now if any unplanned shutdown happens again, I need to double check if any WU is not making progress. ID: 71554 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71559 - Posted: 25 Sep 2024, 10:11:13 UTC - in response to Message 71554. Ok, thanks for checking Ryan's. I will investigate further why tasks with those errors are not stopping, though not immediately as I'm working on HadAM4 at the moment. Yep, those tasks are not recoverable. In an operational environment the system would save all the history (restart) files as it runs through the forecast, so if the latest one is corrupt it can drop back to a previous good one. But that's too expensive in storage for home use. --- CPDN Visiting Scientist ID: 71559 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 71560 - Posted: 26 Sep 2024, 1:11:07 UTC - in response to Message 71559. Last modified: 26 Sep 2024, 1:11:23 UTC In an operational environment the system would save all the history (restart) files as it runs through the forecast, so if the latest one is corrupt it can drop back to a previous good one. Ah, you don't need to save all history to deal with interrupted file writes. You only need space for one additional checkpoint temporarily. Say existing checkpoint is saved in folder A. New checkpoint writes to folder B, fsync and then `mv B A` should effectively give you a transaction on Linux. Resulting folder A should always be in a consistent state and the program always loads from A. Just throwing out the idea and I don't know about Window's guarantee. Not saying you need to change this ever either, given the success rate seems to be pretty high now. ID: 71560 · Reply Quote