climateprediction.net (CPDN) home page
Thread 'oifs_43r3_ps v1.05 slot dir cleanup'

Thread 'oifs_43r3_ps v1.05 slot dir cleanup'

Message boards : Number crunching : oifs_43r3_ps v1.05 slot dir cleanup
Message board moderation

To post messages, you must log in.

AuthorMessage
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 67822 - Posted: 17 Jan 2023, 21:19:41 UTC
Last modified: 17 Jan 2023, 21:35:42 UTC

Glenn,
you mentioned in another thread that the current OpenIFS application leaves some superfluous files in the slot directory if the task is restarted several times. (And you mentioned that this will be addressed in an upcoming application update.) It was said that it is possible to remove older files manually; just the newest one in the respective set of files must be left intact.

Since this info is buried in the various subtopics of the more general threads, I opened this new one.

I now compared the contents of the slot directory of a task which was started only once in its entire lifetime, with that of a task which was started two times.

The latter task had following files in pairs:
BLS20110501000000_000002230000.1_1 (41 MBytes)
BLS20110501000000_000003230000.1_1 (41 MBytes)
LAW20110501000000_000002230000.1_1 (1.6 MBytes)
LAW20110501000000_000003230000.1_1 (1.6 MBytes)
srf00030000.0001 (768 MBytes)
srf00040000.0001 (768 MBytes)
while the former task had just one BLS*, LAW*, and srf* file, respectively.

The newest of the BLS*, LAW*, and srf* files would be replaced with a differently named file while the task is running.

So are these the files of which all but the newest could be deleted if disk space is tight?


________

Oh, and one more thing:

The task which was started two times happened to have the following files in it slot directory:
-rw-r--r-- 1 boinc boinc        19 Jan 17 20:31 boinc_ufs_upload_file_0.zip
-rw-r--r-- 1 boinc boinc         0 Jan 17 20:39 boinc_ufs_upload_file_1.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 20:49 boinc_ufs_upload_file_2.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 20:55 boinc_ufs_upload_file_3.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 21:02 boinc_ufs_upload_file_4.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 21:12 boinc_ufs_upload_file_5.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 21:14 boinc_ufs_upload_file_6.zip
Note the 0 size of *_1.zip.
And its stderr.txt contains a few occasions of
handle_upload_file_status: can't parse boinc_ufs_upload_file_1.zip
Does this mean that the *_1.zip file wasn't properly created, and that the task will fail with computation error after it completed the rest of the computation?
ID: 67822 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,726,716
RAC: 12,672
Message 67998 - Posted: 23 Jan 2023, 16:36:05 UTC - in response to Message 67822.  

Glenn,
you mentioned in another thread that the current OpenIFS application leaves some superfluous files in the slot directory if the task is restarted several times. (And you mentioned that this will be addressed in an upcoming application update.)

Yes, that's correct. I've just passed the updated model executable to CPDN who are about to run a test batch. It will stop these restart files accumulating if the model is restarted, but at the expense of higher risk of failures as the model will no longer have backup restarts/checkpoints it can use.

It was said that it is possible to remove older files manually; just the newest one in the respective set of files must be left intact.

Also correct.

I now compared the contents of the slot directory of a task which was started only once in its entire lifetime, with that of a task which was started two times. The latter task had following files in pairs:
BLS20110501000000_000002230000.1_1 (41 MBytes)
BLS20110501000000_000003230000.1_1 (41 MBytes)
LAW20110501000000_000002230000.1_1 (1.6 MBytes)
LAW20110501000000_000003230000.1_1 (1.6 MBytes)
srf00030000.0001 (768 MBytes)
srf00040000.0001 (768 MBytes)
while the former task had just one BLS*, LAW*, and srf* file, respectively.
So are these the files of which all but the newest could be deleted if disk space is tight?
Exactly. So you could safely delete srf00030000.0001 (the other BLS & LAW files are quite small). The disk bound set for the task allows for the 3 sets of 'srf+BLS+LAW' files to be created during the run (i.e. 2 restarts). However, because some people did not have 'leave non-GPU tasks in memory' set, and a %age cpu of < 100%, the model was restarting frequently and blowing past this limit.

Shouldn't be a problem after these batches. Though 'leave non-GPU in memory' is still strongly recommended to ensure smooth running. HTH.
ID: 67998 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,726,716
RAC: 12,672
Message 68001 - Posted: 23 Jan 2023, 16:49:05 UTC - in response to Message 67822.  

Oh, and one more thing. The task which was started two times happened to have the following files in it slot directory:
-rw-r--r-- 1 boinc boinc        19 Jan 17 20:31 boinc_ufs_upload_file_0.zip
-rw-r--r-- 1 boinc boinc         0 Jan 17 20:39 boinc_ufs_upload_file_1.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 20:49 boinc_ufs_upload_file_2.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 20:55 boinc_ufs_upload_file_3.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 21:02 boinc_ufs_upload_file_4.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 21:12 boinc_ufs_upload_file_5.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 21:14 boinc_ufs_upload_file_6.zip
Note the 0 size of *_1.zip.
And its stderr.txt contains a few occasions of
handle_upload_file_status: can't parse boinc_ufs_upload_file_1.zip
Does this mean that the *_1.zip file wasn't properly created, and that the task will fail with computation error after it completed the rest of the computation?

That's interesting. I've not seen that before. Can you please point me to the task result webpage for this? (if you remember). That's not a string we check for in the failure analysis, it'll be good to catch it to find out how common it is.

I am not very familiar with the boinc side of things, I *think* the ufs files are created by the boinc client itself, not any of our code. I'll ask CPDN as this is more their domain of expertise.
ID: 68001 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,726,716
RAC: 12,672
Message 68004 - Posted: 23 Jan 2023, 18:34:39 UTC - in response to Message 68001.  

From what I can see the boinc_ufs_upload_file*.zip is related to the status of uploads managed by the boinc client. I have 21 uploads on one of my machines in various %age of 'failed' transfers and the number exactly matches the number of boinc_ufs_upload*.zip files. So it's a something the boinc client does and not under any control by the CPDN task. Quite why you have a zero size file I don't know - it's a boinc issue.
ID: 68004 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 68010 - Posted: 23 Jan 2023, 21:04:42 UTC

Thank you for looking at this.
Unfortunately I did not make the connection between slot number and task identity at the time. I'll see if I can find another one of these.

I thought of setting up a little periodic cleanup but haven't put it together yet, because it's better to avoid suspension in the first place in order to lower the risk of task failures, more so than because of the disk space requirement.
ID: 68010 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 68020 - Posted: 24 Jan 2023, 19:50:54 UTC

I put a trivial cleanup script together after all.
#!/bin/bash

echo "=== before ==="
df -h /var/lib/
echo

for d in /var/lib/boinc/slots/*/
do
	for p in BLS LAW srf
	do
		ls ${d}${p}* >/dev/null 2>&1 || continue
		f=($(ls -t ${d}${p}*))
		for ((i=1; i<${#f[*]}; i++))
		do
			rm -f ${f[i]}
		done
	done
done

echo
echo "=== after ==="
df -h /var/lib/
ID: 68020 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,726,716
RAC: 12,672
Message 68221 - Posted: 7 Feb 2023, 20:17:34 UTC - in response to Message 68020.  

The issue of accumulating srf* LAW* BLS* files has been fixed for future batches. You won't need the script after these current ones.

However, is there still an issue of tmp folders being left behind in the projects/climateprediction.net directory. You see folders with names similar to this: oifs_43r3_ps_12200299. The task uses this to temporarily store the model output files in order to zip them up and upload them. Unfortunately, if the task fails with the dreaded 'double corruption' error it doesn't get the chance to tidy this up. Just check for any old folders periodically. I don't get that many.
ID: 68221 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68222 - Posted: 7 Feb 2023, 23:00:04 UTC - in response to Message 68221.  

However, is there still an issue of tmp folders being left behind in the projects/climateprediction.net directory. You see folders with names similar to this: oifs_43r3_ps_12200299. The task uses this to temporarily store the model output files in order to zip them up and upload them. Unfortunately, if the task fails with the dreaded 'double corruption' error it doesn't get the chance to tidy this up. Just check for any old folders periodically. I don't get that many.


I have never gotten any.

OpenIFS 43r3 Perturbed Surface 1.01 x86_64-pc-linux-gnu
Number of tasks completed 	35
Max tasks per day 	39
Number of tasks today 	0
Consecutive valid tasks 	35
Average processing rate 	27.78 GFLOPS
Average turnaround time 	1.65 days

OpenIFS 43r3 Baroclinic Lifecycle 1.07 x86_64-pc-linux-gnu
Number of tasks completed 	13
Max tasks per day 	17
Number of tasks today 	0
Consecutive valid tasks 	13
Average processing rate 	7.51 GFLOPS
Average turnaround time 	0.61 days

OpenIFS 43r3 Perturbed Surface 1.05 x86_64-pc-linux-gnu
Number of tasks completed 	218
Max tasks per day 	222
Number of tasks today 	1
Consecutive valid tasks 	218
Average processing rate 	28.27 GFLOPS
Average turnaround time 	3.45 days

ID: 68222 · Report as offensive     Reply Quote

Message boards : Number crunching : oifs_43r3_ps v1.05 slot dir cleanup

©2024 cpdn.org