Message boards : Number crunching : finish file present too long
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Two tasks claim to have completed successfully for the 1019 NZ25 batch but actually failed with this error: <message> finish file present too long</message>... as shown by this task: https://www.cpdn.org/cpdnboinc/result.php?resultid=22454885 Does someone who knows boinc better than me understand what triggers this? The task has clearly failed, I suspect because the model has fallen over. But then something appears to go wrong on the boinc side and triggers this error message which I've not seen before. Both tasks fail on the same machine, which is probably a clue. Anyone familiar with this message and what exactly is causing it? Why they got logged as successful is also an issue. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
The best I can offer at the moment is https://github.com/BOINC/boinc/issues/3017 - which doesn't help much. Just says 'you are not alone'. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Ok, that is useful. Thanks Richard. At least I now understand what the error message means. The user in question is also running quite an old version of the client, ~7.11, since that error message has been edited in the latest version. But that means the client didn't have the bugfix you pointed to where the wait time was extended from 10secs to 5 mins. So maybe the user Aborted the task, the controller process kills the model, finish file is created, but because the machine was swapping(?) the processes were slow to die and the 10 sec wait on the finish file triggered the client error. I still don't understand why this caused the task to be reported a success since it triggered the debugger but never mind. Since the code has since changed and it's not anything I can do on the CPDN side, I will leave it here. Thanks. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
This error has popped up again on the latest EAS25 batch. I'm posting this in case it's useful later for anyone searching for more information on this bug. The task in question is: https://www.cpdn.org/cpdnboinc/result.php?resultid=22462930. If the stderr has long gone, the key part of the task stderr log is: <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> finish file present too long</message> <stderr_txt> spended CPDN Monitor - Suspend request from BOINC......(lots of suspend requests) ... Suspended CPDN Monitor - Suspend request from BOINC... Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 19636, selfPID = 24304, iMonCtr = 1 Model crash detected, will try to restart... Leaving CPDN_Main::Monitor... monitor:finished called ... tidying up. monitor:finished: Uploading out files... Queuing intermediate upload for CPDN/BOINC: cpdnout_out.zip Detaching shared memory... Done. monitor:finished: Closed output file : stdout_<>.txt modelResultFiles : Removing : wah2_eas25_a3cd_200612_24_1020_012305460 in C:\ProgramData\BOINC/projects/climateprediction.net monitor:finished: handing over to boinc_finish(RetVal=0) 21:29:20 (24304): called boinc_finish(0) CPDN Monitor - Abort request from BOINC... monitor:finished called ... tidying up. monitor:finished: Uploading out files... Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0048C569 write attempt to address 0x0271712C Engaging BOINC Windows Runtime Debugger...The host is a 2950X with 32Gb RAM. It's running quite an old client, as noted before in previous posts. What's odd about this output is the repeated extra lines from the CPDN monitor process AFTER seeming to handover to the boinc client. I think what happens here is the monitor process calls boinc_finish (time stamp 21:29). But something goes wrong in the boinc client with the finish_file and then the client appears to issue an Abort request to the monitor process. This then goes into the 'tidying up' phase again, tries to Upload output files again, which doesn't work as it's already done this. Am uncertain how the mechanism works for the client to re-enter the monitor process this way but it looks like this is the cause of the error. The other issue here is why this gets flagged as a successful completion when the debugger has been initiated. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
You might also consider that in conjunction with the number of "Suspend request from BOINC". BOINC doesn't really like those, and has code to handle 'too many' requests: https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L364 That has a hard limit of 100, where your cited task has 882 in the portion we can see - more have scrolled off the top of the file. This task probably got away with it because of the comment in the code: // handle a task that exited prematurely (i.e. no finish file) // // keep count of premature exits; // if this happens 100 times w/o a checkpoint, abort jobBut if a premature exit is defined as one without a finish file, I sniff the possibility of a link between the two. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I do not have any answers, but here is some data. I have one 1019 task running on my Linux machine. It has uploaded three trickles so far. Task 22463632 Name wah2_nz25_2296_209805_25_1019_012300899_1 Workunit 12300899 Created 22 Jul 2024, 18:59:57 UTC Sent 22 Jul 2024, 19:00:03 UTC Report deadline 30 Oct 2024, 19:00:03 UTC Computer ID 1511241 Since it is _1, I looked at what the previous guy got., Task 22456746 Name wah2_nz25_2296_209805_25_1019_012300899_0 Workunit 12300899 Created 24 Jun 2024, 13:30:32 UTC Sent 24 Jun 2024, 18:35:02 UTC Report deadline 2 Oct 2024, 18:35:02 UTC Received 22 Jul 2024, 18:59:56 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 1542760 Run time 2 days 1 hours 39 min 1 sec CPU time 23 hours 11 min 5 sec Validate state Invalid Credit 4,163.15 Device peak FLOPS 5.60 GFLOPS Application version Weather At Home 2 (wah2) v8.24 windows_intelx86 Peak working set size 264.26 MB Peak swap size 225.38 MB Peak disk usage 0.01 MB Stderr <core_client_version>8.0.2</core_client_version> <![CDATA[ <stderr_txt> forrtl: Handlen er ikke gyldig. Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=48448, selfPID=48448, iMonCtr=2 Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=48448, selfPID=48224, iMonCtr=1 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... 20:51:49 (48224): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>wah2_nz25_2296_209805_25_1019_012300899_0_r1222875874_6.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_nz25_2296_209805_25_1019_012300899_0_r1222875874_7.zip</file_name> <error_code>-240 (stat() failed)</error_code> It goes on like that. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I've looked at the code ref, thanks. I'm not sure that's triggered for multiple Suspend requests though? It specifically mentions calls to 'exit(0)' which wouldn't be the case if the task was suspended in memory. That would just be a SIGSTOP signal sent to the process. This is what happened when the finish_file went to the wrong directory. The task called 'exit(0)', the client can't see the finish_file so the client then restarts it. That code is in the function you pointed to. CPDN tasks don't use checkpointing as boinc understands it. The model handles its own checkpointing; the client has no control. So it's not that. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
The more I look at that task, the weirder it seems. We have: Sent 22 Jul 2024, 16:14:01 UTC Received 23 Jul 2024, 10:47:26 UTCso it was on the user's machine for about 18.5 hours. In that time, it ran for: Run time 1 hours 45 min 38 sec CPU time 1 hours 44 min 28 secNot a high proportion, but we don't know when the first attempt started. We have at least 882 restarts, in somewhere between 1.75 and 18.5 hours. That makes limits of between 8.4 and 0.8 attempts per minute. The only way I can think of pausing an app at that sort of frequency is to set BOTH a thermal limit of less than 100% CPU usage AND remove tasks from memory when suspended. That shouldn't be possible, should it? Don't worry too much about the checkpoint interval. All BOINC projects are like that - apps checkpoint when they're able to, and BOINC can't order "checkpoint NOW". The most BOINC can do is to say "don't do it too often", and apps should respect that. The other thing to check is whether CPDN apps report back to the BOINC client when they're checkpointed, which should reset that counter - I'll try and find that in the code. [* see edit] All my main machines have picked up tasks from this run, so I don't want to mess with them right now. But I do have one last testbed, currently idle, so I might see if I can make that thrash a generic task (not CPDN!) in and out of memory at that speed. [* edit] Looks like messages are handled at line 1535 of the same file, with checkpoints at line 1572 onwards. We can check if they're reported here with a debug log message. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
We have at least 882 restarts, in somewhere between 1.75 and 18.5 hours. That makes limits of between 8.4 and 0.8 attempts per minute. The only way I can think of pausing an app at that sort of frequency is to set BOTH a thermal limit of less than 100% CPU usage AND remove tasks from memory when suspended. That shouldn't be possible, should it?The task is staying in memory, otherwise we'd see the normal model startup messages about checking namelists etc after the suspend message. Might be suspension when non-BOINC usage is above xxx%. If it's a server the person might be using spare cycles for boinc work? The machine looks busy because of the delay on removing the finish_file. The bit I don't understand is why this task was flagged as success. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
The bit I don't understand is why this task was flagged as success.I see unsurprisingly no credit has been granted for the task as no trickle ups have occurred. I do see the computer in question which has a higher floating point speed than my Ryzen9 (It is a 12 core threadripper) has quite a high failure rate, all the ones (13/36 including in progress tasks from current batch) I looked at had nothing in the stderr on the task pages. They had all produced at least one trickle. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Richard, here's an example of the output from a task that's repeatedly being Suspended by the client but not being held in memory whilst suspended. You get a group of 'startup' messages from the model each time. https://www.cpdn.org/cpdnboinc/result.php?resultid=22461413 There must be some semi-intelligent way of detecting this in the monitoring code, though I am unsure what the best approach would be to deal with it. One of my bug-bears with boinc is that it's not possible to send information from the task back up to the client to present to the user. Or send a task that checks what settings the client has where the scheduler can check if 'keep in memory whilst suspended' is on. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
Yes, I see that - and it confirms your comment from yesterday that removing tasks from memory would leave a much bigger footprint in stderr. Scrub that conjecture. I can confirm that the BOINC client is aware of whether, and when, CPDN tasks checkpoint. It shows up in the task 'properties' in BOINC Manager, and can be displayed in the Event Log. So the line about only allowing exit 100 requests per checkpoint applies. But there are two things we don't know about the "Suspended CPDN Monitor - Suspend request from BOINC..." line at this stage: what the time interval between requests is, and exactly what is being requested by BOINC, and why. That line isn't a regular feature of BOINC - it's coming from the bespoke CPDN wrapper. It might be helpful to add a time stamp to each occurrence, at least temporarily. If the wrapper receives any sort of reason code, that could be added too. I've woken up my remaining Celeron test machine, and it's running a single instance of a batch 1022 task - nothing else. I'm regarding that as a sacrificial test case, which we can poke and prod with conjectures as and when we think of them. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I had the same thought. I have added a timestamp to the messages coming from the W@H monitor code in the next version. Adding a reason code is a good idea, I'll do that. Hadn't realised it was possible. Thanks. Although the monitor code 'logs' a checkpoint with the boinc client, that's not necessarily when the model itself does it. That is under the control of the model itself. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
I'm not sure if BOINC provides a reason code in this case - I haven't dug that deeply into the code. But if it does, it would be nice to see it. [But there must be something to flag the difference between 'pause and hold', 'quit and terminate', 'quit and abort', and 'terminate with extreme prejudice'] It doesn't matter a jot if CPDN ignores the interval request set by BOINC - it has to be done at a moment appropriate for the science. But it does matter that the fact that it happens is reported back to BOINC (mainly via the timestamp for 'last checkpoint' changing). And it does, so we're good. |
Send message Joined: 5 Aug 04 Posts: 126 Credit: 24,426,020 RAC: 23,705 |
Fooling-around a little with various preferences, the BOINC log at least does show the difference between suspending due to "computer in use", "CPU is busy", "no recent user activity" and "Exclusive app XXX is running". Where's also the line with "Number of usable CPUs has changed from X to Y." that can happen if uses different CPU % between active and inactive. If any of this information would also be available to applications I've no idea. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Ok thanks. When I've finished with these new batches, I will look in the boinc code and see what can be added to the output. |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
|
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
I think we have our QED. My little low-power Celeron box has just produced its first trickle: 24/07/2024 16:07:13 | climateprediction.net | [cpu_sched] Starting task wah2_eas25_n1c6_201312_24_1022_012313909_0 using wah2_ri version 832 in slot 0 25/07/2024 17:35:54 | climateprediction.net | [checkpoint] result wah2_eas25_n1c6_201312_24_1022_012313909_0 checkpointed 25/07/2024 18:14:30 | climateprediction.net | Sending scheduler request: To send trickle-up message. 25/07/2024 18:14:47 | climateprediction.net | Started upload of wah2_eas25_n1c6_201312_24_1022_012313909_0_r238175385_1.zip 25/07/2024 18:16:36 | climateprediction.net | Finished upload of wah2_eas25_n1c6_201312_24_1022_012313909_0_r238175385_1.zip (99139925 bytes) 25/07/2024 18:27:20 | climateprediction.net | [checkpoint] result wah2_eas25_n1c6_201312_24_1022_012313909_0 checkpointedThat gives an idea of speed: over a day before the first trickle, and over 50 minutes between checkpoints. Then I set 50% CPU usage (BOINC's thermal throttling control), and stderr.txt now starts like this: modelGetExecutables: check control files, strTemp0 & 1 : D:\BOINCdata/projects/climateprediction.net/wah2_eas25_n1c6_201312_24_1022_012313909/jobs/xadae.namelists D:\BOINCdata/projects/climateprediction.net/wah2_eas25_n1c6_201312_24_1022_012313909/jobs/xacxf.namelists modelGetExecutables: unzipping control files : strInput & strTmp wah2_eas25_n1c6_201312_24_1022_012313909.zip wah2_eas25_n1c6_201312_24_1022_012313909/jobs gstrDump[0] = generic_phase1_spinup_eas25_global_aabaka_f gstrDump[1] = generic_phase1_spinup_eas25_regional_aabaka_f global model: command string: "D:\BOINCdata/projects/climateprediction.net/wah2am3m2_um_8.32_windows_intelx86.exe" wah2_eas25_n1c6_201312_24_1022_012313909 generic_phase1_spinup_eas25_global_aabaka_f ic19610227_14_N96 NATclim_ancil_168months_CMIP6-ACCESS-CM2_SST_2009-01-01_2022-12-30_v2404b NATclim_ancil_168months_CMIP6-ACCESS-CM2_SIC_2009-01-01_2022-12-30_v2404b so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5 regional model: command string: "D:\BOINCdata/projects/climateprediction.net/wah2rm3m2t_um_8.32_windows_intelx86.exe" wah2_eas25_n1c6_201312_24_1022_012313909 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. executeModelProcess: MonID=4072, GCM_PID=8144, RCM_PID=6952 Queuing intermediate upload for CPDN/BOINC: cpdnout1.zip Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... ...and so on. I counted 376 requests in 12 minutes before I got it under control again. It suggests that is one cause of the exceeding long lists in stderr, which simply get in the way and waste space - to the exclusion of more useful reports. If we can find a reason code to match, I'd suggest suppressing this particular form of report. And to advise CPDN users not to use throttling if at all possible - reduce the number of cores in use instead. That doesn't solve the 'finish file' problem, of course, but it may help us see more clearly what else is going on. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I had a look at what the code for OpenIFS does. We can find 'finish_file' errors here too: https://www.cpdn.org/cpdnboinc/result.php?resultid=22438218 again, it's an older boinc client 7.14. I think while I'm still hunting bugs I would like more information from the models not less. OIFS puts out a 'resuming' message which would be useful (for me anyway). Plus timestamps on the messages. I take your point it can make for a lot of noise, but that's the problem when debugging these things remotely. I have often toyed with the idea of having a 'developer control' XML file that can be placed in the project/task dir that allows me (or anyone) to control things such as debug message levels. --- CPDN Visiting Scientist |
Send message Joined: 15 May 09 Posts: 4537 Credit: 19,001,532 RAC: 21,726 |
And another All seem to be with boinc7.14.2 This one on the same machine as one of the two from #1020. Would there be any mileage in looking at the release notes of boinc in two or three subsequent versions to see if anything was fixed that might have caused the problem? Just a thought about another angle of attack. |
©2024 cpdn.org