Shell – How to prepare files for rsync on a case insensitive filesystem

case sensitivityfilenamesrsyncshell-scriptzsh

I am transferring a large number of files on a HFS+ filesystem.

The files are currently on ext2 partitions.

I have conflicts due to case insensitivity of the target partition (HFS+).

I want to identify the files that have duplicates filenames once they are in lower case, and delete them if they are actually duplicates.

I also found that I will have duplicate folder names if I convert everyhing to lower case. Basically these hard drives contain years of unsorted data, and I happen to have this problem with folder names too.

Does this seem reasonable:

find . -type f | while read f; do echo $f:l; done | sort | uniq -d

$f:l is ZSH for convert to lower case.

Now I want to keep only one instance of each file that have duplicates.
How to do this efficiently ?

I do not want to find files with duplicate content, unless they have the same lower case filename. I will deal with duplicates later.

Best Answer

The second step in your pipeline is slightly broken (it mangles backslashes and leading and trailing whitespace) and is a complicated way of doing this. Use tr to convert to lowercase. You shouldn't limit the search to files: directories can collide too.

find . | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort | LC_ALL=C uniq -d

Note that this only works if file names don't contain newlines. Under Linux, switch to null bytes as the separator to cope with newlines.

find . -print0 | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort -z | LC_ALL=C uniq -dz

This prints the lowercase versions of file names, which isn't really conducive to doing something about the files.

If you're using zsh, forget about find: zsh has everything you need built in.

setopt extended_glob
for x in **/*; do
  conflicts=($x:h/(#i)$x:t)
  if (($#conflicts > 1)); then
    ## Are all the files identical regular files?
    h=()
    for c in $conflicts; do 
      if [[ -f $c ]]; then
        h+=(${$(md5sum <$c)%% *})
      else
        h=(not regular)
        break
      fi
    done
    if (( ${#${(@u)h}} == 1 )); then
      # Identical regular files, keep only one
      rm -- ${conflicts[1,-2]}
    else
      echo >&2 "Conflicting files:"
      printf >&2 '    %s\n' $conflicts
    fi
  fi
done

Related Solutions

How do case-insensitive filesystems display both upper and lower case file names

A case-insensitive filesystem just means that whenever the filesystem has to ask "does A refer to the same file/directory as B?" it compares the names of files/directories ignoring differences in upper/lowercase (exactly what upper/lowercase differences count depends on the filesystem—it's non-obvious once you get beyond ASCII). A case-sensitive filesystem does not ignore those differences.

A case-preserving filesystem stores file names as given. A non-case-preserving filesystem does not; it'll typically convert all letters to uppercase before storing them (theoretically, it could use lowercase, or RaNsOm NoTe case, or whatever, but AFAIK all real-world ones used uppercase).

You can put those two attributes together in any combination. I'm not sure if you can find non-case-preserving case-sensitive filesystems, but you could certainly create one. All the other combinations exist or existed in real systems, though.

So a case-preserving, case-insensitive filesystem (the most common type of case-insensitive filesystem nowadays) will store and return file names in whatever capitalization you created them or last renamed them, but when comparing two file names (to check if one exists, to open one, to delete one, etc.) it'll ignore case differences.

When you use a case-insensitive filesystem on a Unix box, various utilities will do weird things because Unix traditionally uses case-sensitive filesystems—so they're not expecting Document1 and document1 to be the same file.

In the pwd case, what you're seeing is that it by default just outputs the path you actually used to get to the directory. So if you got there via cd DirName, it'll use DirName in the output. If you got there via DiRnAmE, you'll see DiRnAmE in the output. Bash does this by keeping track of how you got to your current directory in the $PWD environment variable. Mainly this is for symlinks (if you cd into a symlink, you'll see the symlink in your pwd, even though it's actually not part of the path to your current directory). But it also gives the somewhat weird behavior you observe on case-insensitive filesystems. I suspect that pwd -P will give you the directory name using the case stored on disk, but haven't tested.

Shell – How to find duplicate lines in many large files

Since all input files are already sorted, we may bypass the actual sorting step and just use sort -m for merging the files together.

On some Unix systems (to my knowledge only Linux), it may be enough to do

sort -m *.words | uniq -d >dupes.txt

to get the duplicated lines written to the file dupes.txt.

To find what files these lines came from, you may then do

grep -Fx -f dupes.txt *.words

This will instruct grep to treat the lines in dupes.txt (-f dupes.txt) as fixed string patterns (-F). grep will also require that the whole line matches perfectly from start to finish (-x). It will print the file name and the line to the terminal.

Non-Linux Unices (or even more files)

On some Unix systems, 30000 file names will expand to a string that is too long to pass to a single utility (meaning sort -m *.words will fail with Argument list too long, which it does on my OpenBSD system). Even Linux will complain about this if the number of files are much larger.

Finding the dupes

This means that in the general case (this will also work with many more than just 30000 files), one has to "chunk" the sorting:

rm -f tmpfile
find . -type f -name '*.words' -print0 |
xargs -0 sh -c '
    if [ -f tmpfile ]; then
        sort -o tmpfile -m tmpfile "$@"
    else
        sort -o tmpfile -m "$@"
    fi' sh

Alternatively, creating tmpfile without xargs:

rm -f tmpfile
find . -type f -name '*.words' -exec sh -c '
    if [ -f tmpfile ]; then
        sort -o tmpfile -m tmpfile "$@"
    else
        sort -o tmpfile -m "$@"
    fi' sh {} +

This will find all files in the current directory (or below) whose names matches *.words. For an appropriately sized chunk of these names at a time, the size of which is determined by xargs/find, it merges them together into the sorted tmpfile file. If tmpfile already exists (for all but the first chunk), this file is also merged with the other files in the current chunk. Depending on the length of your filenames, and the maximum allowed length of a command line, this may require more or much more than 10 individual runs of the internal script (find/xargs will do this automatically).

The "internal" sh script,

if [ -f tmpfile ]; then
    sort -o tmpfile -m tmpfile "$@"
else
    sort -o tmpfile -m "$@"
fi

uses sort -o tmpfile to output to tmpfile (this won't overwrite tmpfile even if this is also an input to sort) and -m for doing the merge. In both branches, "$@" will expand to a list of individually quoted filenames passed to the script from find or xargs.

Then, just run uniq -d on tmpfile to get all line that are duplicated:

uniq -d tmpfile >dupes.txt

If you like the "DRY" principle ("Don't Repeat Yourself"), you may write the internal script as

if [ -f tmpfile ]; then
    t=tmpfile
else
    t=/dev/null
fi

sort -o tmpfile -m "$t" "$@"

t=tmpfile
[ ! -f "$t" ] && t=/dev/null
sort -o tmpfile -m "$t" "$@"

Where did they come from?

For the same reasons as above, we can't use grep -Fx -f dupes.txt *.words to find where these duplications came from, so instead we use find again:

find . -type f -name '*.words' \
    -exec grep -Fx -f dupes.txt {} +

Since there is no "complicated" processing to be done, we may invoke grep directly from -exec. The -exec option takes a utility command and will place the found names in {}. With + at the end, find will place as many arguments in place of {} as the current shell supports in each invocation of the utility.

To be totally correct, one may want to use either

find . -type f -name '*.words' \
    -exec grep -H -Fx -f dupes.txt {} +

find . -type f -name '*.words' \
    -exec grep -Fx -f dupes.txt /dev/null {} +

to be sure that filenames are always included in the output from grep.

The first variation uses grep -H to always output matching filenames. The last variation uses the fact that grep will include the name of the matching file if more than one file is given on the command line.

This matters since the last chunk of filenames sent to grep from find may actually only contain a single filename, in which case grep would not mention it in its results.

Bonus material:

Dissecting the `find`+`xargs`+`sh` command:

find . -type f -name '*.words' -print0 |
xargs -0 sh -c '
    if [ -f tmpfile ]; then
        sort -o tmpfile -m tmpfile "$@"
    else
        sort -o tmpfile -m "$@"
    fi' sh

find . -type f -name '*.words' will simply generate a list of pathnames from the current directory (or below) where each pathnames is that of a regular file (-type f) and that has a filename component at the end that matches *.words. If only the current directory is to be searched, one may add -maxdepth 1 after the ., before -type f.

-print0 will ensure that all found pathnames are outputted with a \0 (nul) character as delimiter. This is a character that is not valid in a Unix path and it enables us to process pathnames even if they contain newline characters (or other weird things).

find pipes its output to xargs.

xargs -0 will read the \0-delimited list of pathnames and will execute the given utility repeatedly with chunks of these, ensuring that the utility is executed with just enough arguments to not cause the shell to complain about a too long argument list, until there is no more input from find.

The utility invoked by xargs is sh with a script given on the command line as a string using its -c flag.

When invoking sh -c '...some script...' with arguments following, the arguments will be available to the script in $@, except for the first argument, which will be placed in $0 (this is the "command name" that you may spot in e.g. top if you are quick enough). This is why we insert the string sh as the first argument after the end of the actual script. The string sh is a dummy argument and could be any single word (some seem to prefer _ or sh-find).