Message boards : Number crunching : oifs_43r3_ps v1.05 slot dir cleanup
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085 |
Glenn, you mentioned in another thread that the current OpenIFS application leaves some superfluous files in the slot directory if the task is restarted several times. (And you mentioned that this will be addressed in an upcoming application update.) It was said that it is possible to remove older files manually; just the newest one in the respective set of files must be left intact. Since this info is buried in the various subtopics of the more general threads, I opened this new one. I now compared the contents of the slot directory of a task which was started only once in its entire lifetime, with that of a task which was started two times. The latter task had following files in pairs: BLS20110501000000_000002230000.1_1 (41 MBytes) BLS20110501000000_000003230000.1_1 (41 MBytes) LAW20110501000000_000002230000.1_1 (1.6 MBytes) LAW20110501000000_000003230000.1_1 (1.6 MBytes) srf00030000.0001 (768 MBytes) srf00040000.0001 (768 MBytes) while the former task had just one BLS*, LAW*, and srf* file, respectively. The newest of the BLS*, LAW*, and srf* files would be replaced with a differently named file while the task is running. So are these the files of which all but the newest could be deleted if disk space is tight? ________ Oh, and one more thing: The task which was started two times happened to have the following files in it slot directory: -rw-r--r-- 1 boinc boinc 19 Jan 17 20:31 boinc_ufs_upload_file_0.zip -rw-r--r-- 1 boinc boinc 0 Jan 17 20:39 boinc_ufs_upload_file_1.zip -rw-r--r-- 1 boinc boinc 19 Jan 17 20:49 boinc_ufs_upload_file_2.zip -rw-r--r-- 1 boinc boinc 19 Jan 17 20:55 boinc_ufs_upload_file_3.zip -rw-r--r-- 1 boinc boinc 19 Jan 17 21:02 boinc_ufs_upload_file_4.zip -rw-r--r-- 1 boinc boinc 19 Jan 17 21:12 boinc_ufs_upload_file_5.zip -rw-r--r-- 1 boinc boinc 19 Jan 17 21:14 boinc_ufs_upload_file_6.zipNote the 0 size of *_1.zip. And its stderr.txt contains a few occasions of handle_upload_file_status: can't parse boinc_ufs_upload_file_1.zipDoes this mean that the *_1.zip file wasn't properly created, and that the task will fail with computation error after it completed the rest of the computation? |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,726,716 RAC: 12,672 |
Glenn, Yes, that's correct. I've just passed the updated model executable to CPDN who are about to run a test batch. It will stop these restart files accumulating if the model is restarted, but at the expense of higher risk of failures as the model will no longer have backup restarts/checkpoints it can use. It was said that it is possible to remove older files manually; just the newest one in the respective set of files must be left intact. Also correct. I now compared the contents of the slot directory of a task which was started only once in its entire lifetime, with that of a task which was started two times. The latter task had following files in pairs:Exactly. So you could safely delete srf00030000.0001 (the other BLS & LAW files are quite small). The disk bound set for the task allows for the 3 sets of 'srf+BLS+LAW' files to be created during the run (i.e. 2 restarts). However, because some people did not have 'leave non-GPU tasks in memory' set, and a %age cpu of < 100%, the model was restarting frequently and blowing past this limit. Shouldn't be a problem after these batches. Though 'leave non-GPU in memory' is still strongly recommended to ensure smooth running. HTH. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,726,716 RAC: 12,672 |
Oh, and one more thing. The task which was started two times happened to have the following files in it slot directory: That's interesting. I've not seen that before. Can you please point me to the task result webpage for this? (if you remember). That's not a string we check for in the failure analysis, it'll be good to catch it to find out how common it is. I am not very familiar with the boinc side of things, I *think* the ufs files are created by the boinc client itself, not any of our code. I'll ask CPDN as this is more their domain of expertise. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,726,716 RAC: 12,672 |
From what I can see the boinc_ufs_upload_file*.zip is related to the status of uploads managed by the boinc client. I have 21 uploads on one of my machines in various %age of 'failed' transfers and the number exactly matches the number of boinc_ufs_upload*.zip files. So it's a something the boinc client does and not under any control by the CPDN task. Quite why you have a zero size file I don't know - it's a boinc issue. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085 |
Thank you for looking at this. Unfortunately I did not make the connection between slot number and task identity at the time. I'll see if I can find another one of these. I thought of setting up a little periodic cleanup but haven't put it together yet, because it's better to avoid suspension in the first place in order to lower the risk of task failures, more so than because of the disk space requirement. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085 |
I put a trivial cleanup script together after all. #!/bin/bash echo "=== before ===" df -h /var/lib/ echo for d in /var/lib/boinc/slots/*/ do for p in BLS LAW srf do ls ${d}${p}* >/dev/null 2>&1 || continue f=($(ls -t ${d}${p}*)) for ((i=1; i<${#f[*]}; i++)) do rm -f ${f[i]} done done done echo echo "=== after ===" df -h /var/lib/ |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,726,716 RAC: 12,672 |
The issue of accumulating srf* LAW* BLS* files has been fixed for future batches. You won't need the script after these current ones. However, is there still an issue of tmp folders being left behind in the projects/climateprediction.net directory. You see folders with names similar to this: oifs_43r3_ps_12200299. The task uses this to temporarily store the model output files in order to zip them up and upload them. Unfortunately, if the task fails with the dreaded 'double corruption' error it doesn't get the chance to tidy this up. Just check for any old folders periodically. I don't get that many. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
However, is there still an issue of tmp folders being left behind in the projects/climateprediction.net directory. You see folders with names similar to this: oifs_43r3_ps_12200299. The task uses this to temporarily store the model output files in order to zip them up and upload them. Unfortunately, if the task fails with the dreaded 'double corruption' error it doesn't get the chance to tidy this up. Just check for any old folders periodically. I don't get that many. I have never gotten any. OpenIFS 43r3 Perturbed Surface 1.01 x86_64-pc-linux-gnu Number of tasks completed 35 Max tasks per day 39 Number of tasks today 0 Consecutive valid tasks 35 Average processing rate 27.78 GFLOPS Average turnaround time 1.65 days OpenIFS 43r3 Baroclinic Lifecycle 1.07 x86_64-pc-linux-gnu Number of tasks completed 13 Max tasks per day 17 Number of tasks today 0 Consecutive valid tasks 13 Average processing rate 7.51 GFLOPS Average turnaround time 0.61 days OpenIFS 43r3 Perturbed Surface 1.05 x86_64-pc-linux-gnu Number of tasks completed 218 Max tasks per day 222 Number of tasks today 1 Consecutive valid tasks 218 Average processing rate 28.27 GFLOPS Average turnaround time 3.45 days |
©2024 cpdn.org