find similar directories under the current working directory

Here's a great "one liner" (split into 3 for ease of review) for finding similar directories.

Similar means:

  • same number of files, with the exact same names
  • files are of the same size (blocks) on disk.

This identifies "really close" directories as well as exact duplicates. Once you have a list of close directories, you can use 'diff dir1 dir2' to see if two directories are close, or exactly duplicated.

for d in `find $CWD -type d -print`; do echo `cd $d;ls -1ARs .|cksum` $d;done| awk '{c[$1_$2]++; s[$1_$2]=s[$1_$2] " " $3} END {for (i in c) {if (c[i]>1) print s[i]}}'

Here's an example of the input that awk is getting:

[[email protected] test]# for d in `find $CWD -type d -print`; do echo `cd $d;ls -1ARs .|cksum` $d;done
857993900 101 .
1678407360 20 ./b
688927247 28 ./c
1678407360 20 ./a

Each line of input to awk is: checksum, size(bytes), directory name. The "size" is the number of bytes in the output of the "ls -1ARs ." command, which is irrelevant to any size of anything on disk.

The awk program says:

For each line of input, increment the counter in the c array at index $1_$2 - for the first example line that index will be "857993900_101" Then, append a space and then $3 onto the end of "s" at the same index.

When all lines have been read in and processed, the END clause is run.

the for loop iterates over the c array, setting "i" to each index in turn. So, we only print out "s[i]" when "c[i]" is greater than one, meaning two directories had the same "checksum_size".

A simpler version, not as optimized as the above:

for d in `find $CWD -type d -print`; do \
echo `cd $d;ls -1ARs .|cksum|sed 's/ /_/'` $d;done| \
awk '{c[$1]++; s[$1]=s[$1] " " $2} END {for (i in c) {if (c[i]>1) print s[i]}}'

And, a much more costly but guaranteed down-to-the-byte accurate version:

for d in `find $CWD -type d -print`; do echo `cd $d;cksum $(find . -type f -print)|cksum|sed 's/ /_/'` $d;done|awk '{c[$1]++; s[$1]=s[$1] " " $2} END {for (i in c) {if (c[i]>1) print s[i]}}'

For further reading here are more details on how this works:

https://www.quora.com/Linux/Which-Linux-or-Windows-utility-application-helps-to-find-duplicated-folders/answer/Paul-Reiber

Topic revision: r2 - 2012.11.13 - PaulReiber
 
Copyright © is by author. All material on this collaboration platform is the property of its contributing author.