Message boards : Number crunching : stuck task (1006)
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I have a task that is stuck on 98.058%. Task properties shows --- against time since last checkpoint. Restarting client and manager makes no difference. 1. Is there anything I can do to bump start the task? 2. Is there anything I can look at to try and work out what has gone wrong? Nothing in event log or event log backup from before re-starting. I shall wait till after the second task has completed before looking at whether I can start the wah2_8.29_windows_intelx86.exe manually. (Or I could just abort.) |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
Hi Dave, Are you sure the task processes are actually running? That's the first thing to check. There are 3 processes per task. Count the number of wah2_8.29_windows_intelx96.exe, wah2am* & wah2rm* processes you have running in Task Manager (or Resource Manager). If the numbers don't match with however many tasks boincmgr says let me know. If they match then have a look in the model log files to see what's going on. First, find the folder for the task in question. Using one of mine as an example, if boincmgr shows the name of the task as 'wah2_eas25_a3pf_200912_24_1007_012269659_0', go to your boinc data folder (might be hidden), then projects\climateprediction.net, and you should see a folder of the same name but without the trailing _0 (task try number). In the task folder you should see a text file: stdout_mon.txt. This contains a print of the timesteps completed. Check the 'Date modified' column, was the file updated recently? It's normally updated every few minutes. Open the the file up. It'll contain lines like this: wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131015 A - 12/11/2010 17:45 - H:M:S=0061:51:37 AVG= 1.70 DLT= 0.30 wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131016 A - 12/11/2010 18:00 - H:M:S=0061:51:37 AVG= 1.70 DLT= 0.28 wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131017 A - 12/11/2010 18:15 - H:M:S=0061:51:38 AVG= 1.70 DLT= 0.47Your names & numbers will be different, but that's a timestep log of how far the model has got. You can reload this file every few minutes to see if the lines have changed. What you're looking for is changes to the middle of the line: .. A - 12/11/2010 18:15 ....That's the current model date & time. If that shows the model is not progressing, despite the process running, that's unusual. Never seen that before. Zip up the files: stdout_mon.txt, stdout_rm.txt, stdout_um.txt in the task folder, together with stderr.txt in the task's slot folder and email the zip to me. I'll take a look and see what's going on. p.s. had you made any hand edits to the client_status.xml file at all? pp.s. it's not possible to 'hand-start' the task. It has to be done under boinc. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Just two tasks in the vm running windows (tiny10) One wah2_8.29_windows_intelx96.exe is using only Both say they are using 0.5MB RAM but only one has any disk usage. Two wah2am processes are running. one using just 0.8MB RAM, the other 152.1MB Two wah2rm processes show as running one averaging about 26% cpu usage. (VM has 4 cpus allocated. That one shows about 257MB of RAM. The other shows 0% cpu usage and just 1MB of RAM. Pretty obvious that the task crashed I think. Three or four instances of this in stderr. Model crashed: READHIST: End of file in READ from history file for namelist NLIHISTO tmp/xadae.pipe_dummy I couldn't find anything informative in the other files you mentioned. There was a power outage for five minutes which is probably relevant. If you still want the files to check I can send them but I am not hopeful of finding much. I am just glad none of my Linux testing branch tasks suffered! |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
Could you please send the text files? Id like to have a look. The monitor process has obviously disappeared but the models should have stopped as well as they are each checking the other is still running. Not sure what's happening there. Thanks. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Will send in morning. No hand edits to client_state.xml or other BOINC files. Edit: Done. It will be interesting if there is anything significant. I was a bit surprised not to lose anything else from an unplanned powerdown. I did notice that some of the aborted tasks still had their folders sitting there to delete. Not a biggie as I check these things periodically anyway and as you said, that has been fixed or will be before further batches go out. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
Having looked at the output, I think what's happened is the power-cut killed the task as it was writing out the history files (or checkpoint files if you want to call them that). That left them incomplete so when the model restarted it couldn't read the input it needed. The puzzle for me is why the two model processes didn't get killed as well. I should be able to create a test with a corrupted history file and see if I get the same behaviour. --- CPDN Visiting Scientist |
Send message Joined: 31 Aug 04 Posts: 10 Credit: 6,951,410 RAC: 12,756 |
Hey guys - I have a task that seems to be stuck as well, but it is batch 1015. It's Task# 22427152, and has been at 92.788% completed for at least the last few days. I'm running Windows 11 on an Intel i7 machine. I restarted the machine a couple days ago, which reset the "Estimated time remaining" to about 2 days (instead of zero), but the CPU time has remained the same the whole time. I checked stdout_mon.txt and there don't seem to be any logs - see copied text below. No stderr.txt in the project folder. I had several other tasks on the same machine that completed successfully. Any ideas on next steps? This is low priority; I realize you have a LOT going on right now. stdout_mon.txt: worker: Created shared memory region key = wah2_eas25_a1i5_201312_24_1015_012278645 of size 73278744 bytes (version 608) Run for 2 Years and 0 Months pShMem->PRECIS_LATITUDE 185 pShMem->PRECIS_LONGITUDE 285 pShMem->EWSPACEA 0.220000 pShMem->NSSPACEA 0.220000 pShMem->FRSTLATA 19.100000 pShMem->FRSTLONA 328.500000 pShMem->POLELATA 55.500000 pShMem->POLELONA 308.000000 pShMem->L_RUN_REGION 1 pShMem->UPLOAD_INTERVAL 0 ulTotalPhaseTimestep 276864 Starting model ID wah2_eas25_a1i5_201312_24_1015_012278645 Phase 1 Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.29_windows_intelx86.exe" wah2_eas25_a1i5_201312_24_1015_012278645 generic_phase1_spinup_eas25_global_aabaka ic19610127_11_N96 AERclim_ancil_168months_CMIP6-IPSL-CM6A-LR_SST_2009-01-01_2022-12-30_v2404 AERclim_ancil_168months_CMIP6-IPSL-CM6A-LR_SIC_2009-01-01_2022-12-30_v2404 SO2DMS_N96_cmip6hist-ssp245_2009-2020 oxi.addfa ozone_cmip6hist-ssp245_N96_1979_2031 Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.29_windows_intelx86.exe" wah2_eas25_a1i5_201312_24_1015_012278645 executeModelProcess: MonID=17500, GCM_PID=18344, RCM_PID=14596 end stdout_mon.txt Properties from BOINC Manager: Application Weather At Home 2 (wah2) (region independent) 8.29 Name wah2_eas25_a1i5_201312_24_1015_012278645 State Running Received 4/15/2024 8:29:44 AM Report deadline 6/24/2024 8:29:44 AM Estimated computation size 3,801,388 GFLOPs CPU time 9d 22:53:27 CPU time since checkpoint --- Elapsed time 24d 03:30:08 Estimated time remaining --- Fraction done 92.788% Virtual memory size 185.32 MB Working set size 74.48 MB Directory slots/11 Process ID 17500 Executable wah2_8.29_windows_intelx86.exe |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
stdout_mon.txt:Those 3 numbers are the process IDs (PIDs) for the task. Have a look in Task Manager or Resource Manager and see if those 3 process numbers are actually running. I suspect only 17500 is present and the other two are not. That indicates the model has died but the BOINC side of things has not realised. In which case, I suggest shutting down the boinc client (not the machine) and restarting it. Hopefully that will clear it. Cheers. --- CPDN Visiting Scientist |
Send message Joined: 31 Aug 04 Posts: 10 Credit: 6,951,410 RAC: 12,756 |
Thanks for the update, Glenn. Checked Task Manager and all 3 process IDs were in a Running state, but not consuming any CPU. I tried stopping the client and restarting, which reset the remaining time to 1d17h+, but I can see that it's still not using any CPU. Again, not a high priority. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
So after client restart, all 3 processes are there but apparently not consuming CPU? Is that right? Let's check a few things. You mentioned the stdout_mon.txt file in the task's working directory. Can you please go back to that file see what the last line is. I'm interested to see if there are any lines indicating the model is running timesteps. If it is, you'll see lines like this: wah2_eas25_h000_208912_24_d643_000011038 - PH 1 TS 0025346 A - 07/02/2090 00:30 - H:M:S=0010:12:20 AVG= 1.45 DLT= 0.00 The global model output log is in the text file : dataout/xadae.out (relative to the stdout_mon.txt file). The regional model log is dataout/xacxf.out. These files might be quite long, or they'll be empty. If long, then please just post the last 10 lines from each file. Also could you let me know the sizes of the following 3 files, also in 'dataout': atmos_restart.day, region_restart.day, shmem_restart.day. I'd like to compare them with known working filesizes. What might be happening is all the processes start but the global & regional models haven't started running timesteps. They hand-shake via shared memory; the shmem_restart.day is a dump of the shared memory block. If that file is damaged, then it could be that both models are waiting for the other to finish. If any of these restart files are damaged, there's nothing you can do to recover it as we don't keep more than 1 set of restarts to save on filespace. I think the task is probably a lost cause unfortunately, but it would be useful to understand what's happening. Thanks for the update, Glenn. Checked Task Manager and all 3 process IDs were in a Running state, but not consuming any CPU. I tried stopping the client and restarting, which reset the remaining time to 1d17h+, but I can see that it's still not using any CPU. Again, not a high priority. --- CPDN Visiting Scientist |
Send message Joined: 31 Aug 04 Posts: 10 Credit: 6,951,410 RAC: 12,756 |
That's right ... all 3 processes running but not consuming CPU. The text copied above from stdout_mon.txt was the entire contents of the file. It has been updated now that I restarted, but the last line is now: Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.29_windows_intelx86.exe" wah2_eas25_a1i5_201312_24_1015_012278645 executeModelProcess: MonID=7160, GCM_PID=16392, RCM_PID=21956 dataout/xadae.out (entire contents): ===================================================== GCOM Version ===================================================== Start of UM Job : 22:24:27 on 12/05/2024 Setting UM_NAM_MAX_SECONDS to 0 ********************************************************************************* Model aborted with error code - 1 Routine and message:- READHIST: End of file in READ from history file for namelist NLIHISTO ********************************************************************************* dataout/xacxf.out (last 10 of 138410 lines): CPDN: RM Change rcvd comm1 3 CPDN: RM Change rcvd comm1 3 CPDN: RM Change rcvd comm1 3 CPDN: RM Change rcvd comm1 3 CPDN: RM Change rcvd comm1 3 CPDN: RM Change rcvd comm1 3 CPDN: RM Change rcvd comm1 3 CPDN: RM Change rcvd comm1 3 CPDN: RM Change rcvd comm1 3 CPDN: RM Change rcvd comm1 3 File sizes: atmos_restart.day 56808 KB region_restart.day 108774 KB shmem_restart.day 20937 KB |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
*********************************************************************************Right, thanks. That's the problem. The global model has tried to read its history file and failed. However, it should have exited at that point but didn't for some odd reason. Not sure why. Because the global model process didn't exit, the others didn't either (they each check the others are still running). I'd need to look in the code to remind me which file it's trying to read. It'll be one of : dataout/xadae.phist or dataout/xadae.thist. Those two text files should be identical, each contains 173 lines and the top of the file should look similar to: &NLIHISTO MODEL_DATA_TIME = 1840, 12, 1, 3*0, RUN_MEANCTL_RESTART = 0, RUN_INDIC_OP = 0, RUN_RESUBMIT_TARGET = 0, 3, 19, 3*0, FT_LASTFIELD = 40*0, 119, 152, 114, 78*0, 442, 8*0 /What does your file look like? The error message suggests the file is empty or truncated. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
I've looked into this and can see what's happening in the code. The file the model is trying to read is either missing or has zero contents. That should cause the model process to die but I've identified the error in the code which means that doesn't happen. So thanks for reporting the issue, it's a big help. That error is unrecoverable so feel free to Abort the task. --- CPDN Visiting Scientist |
Send message Joined: 31 Aug 04 Posts: 10 Credit: 6,951,410 RAC: 12,756 |
Both files are 14KB, and each contains 14239 NUL characters. |
Send message Joined: 31 Aug 04 Posts: 10 Credit: 6,951,410 RAC: 12,756 |
Thanks for checking on it. I'll abort the task. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
Both files are 14KB, and each contains 14239 NUL characters.Ok, thanks for letting me know. Not sure how that could have happened. I fixed a memory overwrite bug that was causing the model to crash at the start of the new year, but it wasn't anywhere near the code at the start of the model run. I can fix the hang but there might be something else that's causing a file full of nothing. I've created an issue for this and I should get time to fix it before the next version goes out. Thanks for the report. --- CPDN Visiting Scientist |
©2024 cpdn.org