climateprediction.net home page
stuck task (1006)

stuck task (1006)

Message boards : Number crunching : stuck task (1006)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70629 - Posted: 8 Mar 2024, 10:31:04 UTC

I have a task that is stuck on 98.058%. Task properties shows --- against time since last checkpoint. Restarting client and manager makes no difference.
1. Is there anything I can do to bump start the task?
2. Is there anything I can look at to try and work out what has gone wrong? Nothing in event log or event log backup from before re-starting.

I shall wait till after the second task has completed before looking at whether I can start the wah2_8.29_windows_intelx86.exe manually. (Or I could just abort.)
ID: 70629 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,376,996
RAC: 14,200
Message 70630 - Posted: 8 Mar 2024, 11:30:10 UTC - in response to Message 70629.  
Last modified: 8 Mar 2024, 11:33:39 UTC

Hi Dave,

Are you sure the task processes are actually running? That's the first thing to check. There are 3 processes per task. Count the number of wah2_8.29_windows_intelx96.exe, wah2am* & wah2rm* processes you have running in Task Manager (or Resource Manager). If the numbers don't match with however many tasks boincmgr says let me know.

If they match then have a look in the model log files to see what's going on. First, find the folder for the task in question. Using one of mine as an example, if boincmgr shows the name of the task as 'wah2_eas25_a3pf_200912_24_1007_012269659_0', go to your boinc data folder (might be hidden), then projects\climateprediction.net, and you should see a folder of the same name but without the trailing _0 (task try number).

In the task folder you should see a text file: stdout_mon.txt. This contains a print of the timesteps completed. Check the 'Date modified' column, was the file updated recently? It's normally updated every few minutes.

Open the the file up. It'll contain lines like this:
wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131015 A - 12/11/2010 17:45 - H:M:S=0061:51:37 AVG= 1.70 DLT= 0.30
wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131016 A - 12/11/2010 18:00 - H:M:S=0061:51:37 AVG= 1.70 DLT= 0.28
wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131017 A - 12/11/2010 18:15 - H:M:S=0061:51:38 AVG= 1.70 DLT= 0.47
Your names & numbers will be different, but that's a timestep log of how far the model has got. You can reload this file every few minutes to see if the lines have changed. What you're looking for is changes to the middle of the line:
.. A - 12/11/2010 18:15 ....
That's the current model date & time.

If that shows the model is not progressing, despite the process running, that's unusual. Never seen that before.

Zip up the files: stdout_mon.txt, stdout_rm.txt, stdout_um.txt in the task folder, together with stderr.txt in the task's slot folder and email the zip to me. I'll take a look and see what's going on.

p.s. had you made any hand edits to the client_status.xml file at all?
pp.s. it's not possible to 'hand-start' the task. It has to be done under boinc.
---
CPDN Visiting Scientist
ID: 70630 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70631 - Posted: 8 Mar 2024, 14:18:23 UTC - in response to Message 70630.  

Just two tasks in the vm running windows (tiny10)

One wah2_8.29_windows_intelx96.exe is using only Both say they are using 0.5MB RAM but only one has any disk usage.
Two wah2am processes are running. one using just 0.8MB RAM, the other 152.1MB
Two wah2rm processes show as running one averaging about 26% cpu usage. (VM has 4 cpus allocated. That one shows about 257MB of RAM. The other shows 0% cpu usage and just 1MB of RAM.

Pretty obvious that the task crashed I think. Three or four instances of this in stderr.

Model crashed: READHIST: End of file in READ from history file for namelist NLIHISTO                                                                                                                                                                                           tmp/xadae.pipe_dummy 


I couldn't find anything informative in the other files you mentioned. There was a power outage for five minutes which is probably relevant. If you still want the files to check I can send them but I am not hopeful of finding much. I am just glad none of my Linux testing branch tasks suffered!
ID: 70631 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,376,996
RAC: 14,200
Message 70632 - Posted: 8 Mar 2024, 19:41:00 UTC - in response to Message 70631.  

Could you please send the text files? Id like to have a look.

The monitor process has obviously disappeared but the models should have stopped as well as they are each checking the other is still running. Not sure what's happening there.

Thanks.
---
CPDN Visiting Scientist
ID: 70632 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,961,772
RAC: 21,888
Message 70633 - Posted: 8 Mar 2024, 21:32:31 UTC - in response to Message 70632.  
Last modified: 9 Mar 2024, 17:34:29 UTC

Will send in morning. No hand edits to client_state.xml or other BOINC files.

Edit: Done. It will be interesting if there is anything significant. I was a bit surprised not to lose anything else from an unplanned powerdown. I did notice that some of the aborted tasks still had their folders sitting there to delete. Not a biggie as I check these things periodically anyway and as you said, that has been fixed or will be before further batches go out.
ID: 70633 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,376,996
RAC: 14,200
Message 70634 - Posted: 9 Mar 2024, 22:33:22 UTC - in response to Message 70633.  
Last modified: 9 Mar 2024, 22:33:51 UTC

Having looked at the output, I think what's happened is the power-cut killed the task as it was writing out the history files (or checkpoint files if you want to call them that). That left them incomplete so when the model restarted it couldn't read the input it needed.

The puzzle for me is why the two model processes didn't get killed as well. I should be able to create a test with a corrupted history file and see if I get the same behaviour.
---
CPDN Visiting Scientist
ID: 70634 · Report as offensive     Reply Quote
rfbrooks

Send message
Joined: 31 Aug 04
Posts: 10
Credit: 6,793,812
RAC: 15,545
Message 70916 - Posted: 11 May 2024, 21:56:33 UTC

Hey guys - I have a task that seems to be stuck as well, but it is batch 1015. It's Task# 22427152, and has been at 92.788% completed for at least the last few days. I'm running Windows 11 on an Intel i7 machine. I restarted the machine a couple days ago, which reset the "Estimated time remaining" to about 2 days (instead of zero), but the CPU time has remained the same the whole time. I checked stdout_mon.txt and there don't seem to be any logs - see copied text below. No stderr.txt in the project folder. I had several other tasks on the same machine that completed successfully.

Any ideas on next steps? This is low priority; I realize you have a LOT going on right now.

stdout_mon.txt:

worker: Created shared memory region key = wah2_eas25_a1i5_201312_24_1015_012278645 of size 73278744 bytes (version 608)
Run for 2 Years and 0 Months
pShMem->PRECIS_LATITUDE 185
pShMem->PRECIS_LONGITUDE 285
pShMem->EWSPACEA 0.220000
pShMem->NSSPACEA 0.220000
pShMem->FRSTLATA 19.100000
pShMem->FRSTLONA 328.500000
pShMem->POLELATA 55.500000
pShMem->POLELONA 308.000000
pShMem->L_RUN_REGION 1
pShMem->UPLOAD_INTERVAL 0
ulTotalPhaseTimestep 276864
Starting model ID wah2_eas25_a1i5_201312_24_1015_012278645 Phase 1
Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.29_windows_intelx86.exe" wah2_eas25_a1i5_201312_24_1015_012278645 generic_phase1_spinup_eas25_global_aabaka ic19610127_11_N96 AERclim_ancil_168months_CMIP6-IPSL-CM6A-LR_SST_2009-01-01_2022-12-30_v2404 AERclim_ancil_168months_CMIP6-IPSL-CM6A-LR_SIC_2009-01-01_2022-12-30_v2404 SO2DMS_N96_cmip6hist-ssp245_2009-2020 oxi.addfa ozone_cmip6hist-ssp245_N96_1979_2031
Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.29_windows_intelx86.exe" wah2_eas25_a1i5_201312_24_1015_012278645
executeModelProcess: MonID=17500, GCM_PID=18344, RCM_PID=14596

end stdout_mon.txt

Properties from BOINC Manager:
Application
Weather At Home 2 (wah2) (region independent) 8.29
Name
wah2_eas25_a1i5_201312_24_1015_012278645
State
Running
Received
4/15/2024 8:29:44 AM
Report deadline
6/24/2024 8:29:44 AM
Estimated computation size
3,801,388 GFLOPs
CPU time
9d 22:53:27
CPU time since checkpoint
---
Elapsed time
24d 03:30:08
Estimated time remaining
---
Fraction done
92.788%
Virtual memory size
185.32 MB
Working set size
74.48 MB
Directory
slots/11
Process ID
17500
Executable
wah2_8.29_windows_intelx86.exe
ID: 70916 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,376,996
RAC: 14,200
Message 70917 - Posted: 12 May 2024, 22:30:59 UTC - in response to Message 70916.  

stdout_mon.txt:
.....
executeModelProcess: MonID=17500, GCM_PID=18344, RCM_PID=14596
Those 3 numbers are the process IDs (PIDs) for the task. Have a look in Task Manager or Resource Manager and see if those 3 process numbers are actually running. I suspect only 17500 is present and the other two are not. That indicates the model has died but the BOINC side of things has not realised.

In which case, I suggest shutting down the boinc client (not the machine) and restarting it. Hopefully that will clear it.

Cheers.
---
CPDN Visiting Scientist
ID: 70917 · Report as offensive     Reply Quote
rfbrooks

Send message
Joined: 31 Aug 04
Posts: 10
Credit: 6,793,812
RAC: 15,545
Message 70918 - Posted: 13 May 2024, 2:45:49 UTC - in response to Message 70917.  

Thanks for the update, Glenn. Checked Task Manager and all 3 process IDs were in a Running state, but not consuming any CPU. I tried stopping the client and restarting, which reset the remaining time to 1d17h+, but I can see that it's still not using any CPU. Again, not a high priority.
ID: 70918 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,376,996
RAC: 14,200
Message 70919 - Posted: 13 May 2024, 20:58:52 UTC - in response to Message 70918.  

So after client restart, all 3 processes are there but apparently not consuming CPU? Is that right?

Let's check a few things. You mentioned the stdout_mon.txt file in the task's working directory. Can you please go back to that file see what the last line is. I'm interested to see if there are any lines indicating the model is running timesteps. If it is, you'll see lines like this:
wah2_eas25_h000_208912_24_d643_000011038 - PH 1 TS 0025346 A - 07/02/2090 00:30 - H:M:S=0010:12:20 AVG= 1.45 DLT= 0.00

The global model output log is in the text file : dataout/xadae.out (relative to the stdout_mon.txt file). The regional model log is dataout/xacxf.out. These files might be quite long, or they'll be empty. If long, then please just post the last 10 lines from each file.

Also could you let me know the sizes of the following 3 files, also in 'dataout': atmos_restart.day, region_restart.day, shmem_restart.day. I'd like to compare them with known working filesizes.

What might be happening is all the processes start but the global & regional models haven't started running timesteps. They hand-shake via shared memory; the shmem_restart.day is a dump of the shared memory block. If that file is damaged, then it could be that both models are waiting for the other to finish.

If any of these restart files are damaged, there's nothing you can do to recover it as we don't keep more than 1 set of restarts to save on filespace. I think the task is probably a lost cause unfortunately, but it would be useful to understand what's happening.

Thanks for the update, Glenn. Checked Task Manager and all 3 process IDs were in a Running state, but not consuming any CPU. I tried stopping the client and restarting, which reset the remaining time to 1d17h+, but I can see that it's still not using any CPU. Again, not a high priority.

---
CPDN Visiting Scientist
ID: 70919 · Report as offensive     Reply Quote
rfbrooks

Send message
Joined: 31 Aug 04
Posts: 10
Credit: 6,793,812
RAC: 15,545
Message 70920 - Posted: 13 May 2024, 21:53:43 UTC - in response to Message 70919.  

That's right ... all 3 processes running but not consuming CPU.

The text copied above from stdout_mon.txt was the entire contents of the file. It has been updated now that I restarted, but the last line is now:

Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.29_windows_intelx86.exe" wah2_eas25_a1i5_201312_24_1015_012278645
executeModelProcess: MonID=7160, GCM_PID=16392, RCM_PID=21956

dataout/xadae.out (entire contents):
=====================================================
GCOM Version
=====================================================

Start of UM Job : 22:24:27 on 12/05/2024
Setting UM_NAM_MAX_SECONDS to 0
*********************************************************************************
Model aborted with error code - 1 Routine and message:-
READHIST: End of file in READ from history file for namelist NLIHISTO
*********************************************************************************


dataout/xacxf.out (last 10 of 138410 lines):
CPDN: RM Change rcvd comm1 3
CPDN: RM Change rcvd comm1 3
CPDN: RM Change rcvd comm1 3
CPDN: RM Change rcvd comm1 3
CPDN: RM Change rcvd comm1 3
CPDN: RM Change rcvd comm1 3
CPDN: RM Change rcvd comm1 3
CPDN: RM Change rcvd comm1 3
CPDN: RM Change rcvd comm1 3
CPDN: RM Change rcvd comm1 3

File sizes:
atmos_restart.day 56808 KB
region_restart.day 108774 KB
shmem_restart.day 20937 KB
ID: 70920 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,376,996
RAC: 14,200
Message 70921 - Posted: 13 May 2024, 22:40:20 UTC - in response to Message 70920.  

*********************************************************************************
Model aborted with error code - 1 Routine and message:-
READHIST: End of file in READ from history file for namelist NLIHISTO
*********************************************************************************
Right, thanks. That's the problem. The global model has tried to read its history file and failed. However, it should have exited at that point but didn't for some odd reason. Not sure why. Because the global model process didn't exit, the others didn't either (they each check the others are still running).

I'd need to look in the code to remind me which file it's trying to read. It'll be one of : dataout/xadae.phist or dataout/xadae.thist. Those two text files should be identical, each contains 173 lines and the top of the file should look similar to:

 &NLIHISTO
 MODEL_DATA_TIME =        1840,          12,           1, 3*0,
 RUN_MEANCTL_RESTART     =           0,
 RUN_INDIC_OP    =           0,
 RUN_RESUBMIT_TARGET     =           0,           3,          19, 3*0,
 FT_LASTFIELD    = 40*0,         119,         152,         114, 78*0,         442, 8*0
 /
What does your file look like? The error message suggests the file is empty or truncated.
---
CPDN Visiting Scientist
ID: 70921 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,376,996
RAC: 14,200
Message 70922 - Posted: 14 May 2024, 13:36:43 UTC - in response to Message 70921.  

I've looked into this and can see what's happening in the code. The file the model is trying to read is either missing or has zero contents. That should cause the model process to die but I've identified the error in the code which means that doesn't happen. So thanks for reporting the issue, it's a big help.

That error is unrecoverable so feel free to Abort the task.
---
CPDN Visiting Scientist
ID: 70922 · Report as offensive     Reply Quote
rfbrooks

Send message
Joined: 31 Aug 04
Posts: 10
Credit: 6,793,812
RAC: 15,545
Message 70923 - Posted: 14 May 2024, 22:11:39 UTC

Both files are 14KB, and each contains 14239 NUL characters.
ID: 70923 · Report as offensive     Reply Quote
rfbrooks

Send message
Joined: 31 Aug 04
Posts: 10
Credit: 6,793,812
RAC: 15,545
Message 70924 - Posted: 14 May 2024, 22:39:48 UTC - in response to Message 70923.  

Thanks for checking on it. I'll abort the task.
ID: 70924 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,376,996
RAC: 14,200
Message 70926 - Posted: 15 May 2024, 8:29:41 UTC - in response to Message 70923.  

Both files are 14KB, and each contains 14239 NUL characters.
Ok, thanks for letting me know. Not sure how that could have happened. I fixed a memory overwrite bug that was causing the model to crash at the start of the new year, but it wasn't anywhere near the code at the start of the model run. I can fix the hang but there might be something else that's causing a file full of nothing.

I've created an issue for this and I should get time to fix it before the next version goes out. Thanks for the report.
---
CPDN Visiting Scientist
ID: 70926 · Report as offensive     Reply Quote

Message boards : Number crunching : stuck task (1006)

©2024 cpdn.org