climateprediction.net home page
Tasks stuck for days

Tasks stuck for days

Message boards : Number crunching : Tasks stuck for days
Message board moderation

To post messages, you must log in.

AuthorMessage
Ryan Munro

Send message
Joined: 9 Nov 20
Posts: 6
Credit: 6,941,431
RAC: 2,975
Message 71513 - Posted: 20 Sep 2024, 10:46:48 UTC

I have a few tasks now stuck at around 86% with 14d reported elapsed time, they have done longer than this as it has shown 14d for a while now, remaining time shows nothing.
Should I leave them or abort them?

Properties from the example unit

Application
Weather At Home 2 (wah2) (region independent) 8.29 
Name
wah2_eas25_a14t_201212_24_1015_012278165
State
Suspended - user request
Received
16/04/2024 11:54:52
Report deadline
25/06/2024 11:54:51
Estimated computation size
3,801,388 GFLOPs
CPU time
10d 20:05:20
CPU time since checkpoint
---
Elapsed time
14d 18:29:19
Estimated time remaining
---
Fraction done
86.824%
Virtual memory size
185.28 MB
Working set size
75.98 MB
Directory
slots/39
Process ID
26168
Progress rate
0.360% per hour
Executable
wah2_8.29_windows_intelx86.exe
ID: 71513 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,962,600
RAC: 21,639
Message 71514 - Posted: 20 Sep 2024, 10:58:01 UTC

Suspended - user request

Did you suspend the task in order to see if setting it to run again would get it going?
ID: 71514 · Report as offensive     Reply Quote
Ryan Munro

Send message
Joined: 9 Nov 20
Posts: 6
Credit: 6,941,431
RAC: 2,975
Message 71521 - Posted: 20 Sep 2024, 16:46:08 UTC - in response to Message 71514.  

I suspend Boinc during the day and let it run overnight
ID: 71521 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,391,077
RAC: 15,319
Message 71523 - Posted: 20 Sep 2024, 17:03:31 UTC - in response to Message 71513.  
Last modified: 20 Sep 2024, 17:06:30 UTC

Hi Ryan,

Abort them. That task you showed is from batch 1015 (notice the _1015_ in the name). That batch was closed way back in July because there was a problem with that version of the Weather@Home app.

Abort any other tasks from 1015.

Glenn
---
CPDN Visiting Scientist
ID: 71523 · Report as offensive     Reply Quote
Ryan Munro

Send message
Joined: 9 Nov 20
Posts: 6
Credit: 6,941,431
RAC: 2,975
Message 71524 - Posted: 20 Sep 2024, 18:16:05 UTC - in response to Message 71523.  

Will do thanks :)
ID: 71524 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,530,782
RAC: 58,330
Message 71546 - Posted: 22 Sep 2024, 18:49:14 UTC

I just had something similar on newer batches due to unplanned power outage. Usually when such incidents happen, tasks would either successfully resume, or error out with computation error soon after restart.

However, this time I have 4 WUs that did not error out, nor making any progress. I noticed this when checking CPU utilization because they are not consuming any CPU cycles. After waiting for 15 minutes or so, I tried restarting the boinc client to force a restart but they were stuck in the same way. From the logs, they all have "Model crashed" in the output.

The WUs that I had to manually abort are the following:
https://main.cpdn.org/result.php?resultid=22468845
https://main.cpdn.org/result.php?resultid=22503701
https://main.cpdn.org/result.php?resultid=22505634
https://main.cpdn.org/result.php?resultid=22498243

Compared to the ones that error'ed out, which is the normal behavior that occasionally happens due to unplanned outage. (Still much reduced compared to 8.24)
https://main.cpdn.org/result.php?resultid=22504490
https://main.cpdn.org/result.php?resultid=22476951
ID: 71546 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,391,077
RAC: 15,319
Message 71551 - Posted: 23 Sep 2024, 12:05:50 UTC - in response to Message 71546.  
Last modified: 23 Sep 2024, 12:06:52 UTC

I looked at the first task on that list and it failed with this error:

Model crashed: READHIST: End of file in READ from history file for namelist NLIHISTO
Any reference to a history file means one of the models was trying to restart when the client started the task. If it hits end of file it usually means the previous process did not finish writing the history (restart) file for some reason, or the file has been corrupted on disk. I've seen this behaviour in testing and logged an issue for it. I don't know if it's a code issue that's always been there or I've introduced.

Without looking at Ryan's earlier tasks I can't say if it's the same behaviour.
---
CPDN Visiting Scientist
ID: 71551 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,530,782
RAC: 58,330
Message 71554 - Posted: 23 Sep 2024, 22:15:05 UTC - in response to Message 71551.  

Thanks for taking a look. This line indeed only appeared in the jobs that were stuck from my samples. I found Ryan's task with the same log line too: https://main.cpdn.org/result.php?resultid=22426672.

From what you described, the task is likely not salvageable at that point. If it could report an error instead of hanging around, that would be rather helpful. Guess for now if any unplanned shutdown happens again, I need to double check if any WU is not making progress.
ID: 71554 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,391,077
RAC: 15,319
Message 71559 - Posted: 25 Sep 2024, 10:11:13 UTC - in response to Message 71554.  

Ok, thanks for checking Ryan's. I will investigate further why tasks with those errors are not stopping, though not immediately as I'm working on HadAM4 at the moment.

Yep, those tasks are not recoverable. In an operational environment the system would save all the history (restart) files as it runs through the forecast, so if the latest one is corrupt it can drop back to a previous good one. But that's too expensive in storage for home use.
---
CPDN Visiting Scientist
ID: 71559 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,530,782
RAC: 58,330
Message 71560 - Posted: 26 Sep 2024, 1:11:07 UTC - in response to Message 71559.  
Last modified: 26 Sep 2024, 1:11:23 UTC

In an operational environment the system would save all the history (restart) files as it runs through the forecast, so if the latest one is corrupt it can drop back to a previous good one.

Ah, you don't need to save all history to deal with interrupted file writes. You only need space for one additional checkpoint temporarily. Say existing checkpoint is saved in folder A. New checkpoint writes to folder B, fsync and then `mv B A` should effectively give you a transaction on Linux. Resulting folder A should always be in a consistent state and the program always loads from A.

Just throwing out the idea and I don't know about Window's guarantee. Not saying you need to change this ever either, given the success rate seems to be pretty high now.
ID: 71560 · Report as offensive     Reply Quote

Message boards : Number crunching : Tasks stuck for days

©2024 cpdn.org