If I understand correctly, you want to move files from the current directory and its subdirectories recursively to another directory, but only if the file
command reports them as “Microsoft Word” files. That is, you're interested in the files for which file "$filename" | grep 'Microsoft Word'
produces some output.
An easy way is to take things calmly and to it file by file. If you only want the files in the current directory, you can use a for
loop and a wildcard pattern:
for f in *.doc; do
if …
done
What's the condition? We want to test if Microsoft Word
appears in the output of file "$f"
. I use file --
to protect against files whose name begins with -
.
for f in *.doc; do
if file -- "$f" | grep -s 'Microsoft Word'; then
…
fi
done
All we need to do is add the command to move the files.
for f in *.doc; do
if file -- "$f" | grep -s 'Microsoft Word'; then
mv -- "$f" ../NewDirectory/
fi
done
If you want to look for files in subdirectories as well, use the **
wilcdard pattern for recursive globbing. In bash, it needs to be activated with shopt -s globstar
(in ksh93, you need set -o globstar
, and in zsh it works out of the box; other shells lack this feature). Beware that bash ≤4.2 follows symbolic links to directories.
for f in **/*.doc; do
if file -- "$f" | grep -s 'Microsoft Word'; then
mv -- "$f" ../NewDirectory/
fi
done
Note that all moved files end in ../NewDirectory/
, no subdirectories are created. If you want to reproduce the directory tree, you can use string manipulation constructs to extract the directory part of the file name and mkdir -p
to create the target directory if necessary.
for f in ./**/*.doc; do
if file "$f" | grep -s 'Microsoft Word'; then
d="${f%/*}"
mkdir -p ../NewDirectory/"$d"
mv "$f" ../NewDirectory/"$d"
fi
done
Rather than parse the output of file
, which is somewhat fragile, you might prefer to parse file -i
, which prints standardized strings.
This calls for backreferences!
If you are ever referring to something you have already matched, and you want to match it again, use backreferences.
grep '(..)(.*\1){<n - 1>}' <file>
.*
matches any sequence of characters
(..)
matches any two characters
\1
matches the first group, in this case the (..)
near the beginning
Substitute <n - 1>
for the length of the sequence minus one, and <file>
with the name of the file you want to look for (or omit it to use stdin).
This may not be the most efficient solution, but it works.
Best Answer
With GNU tools:
You can do standardly:
But that would run up to two
grep
s per file. To avoid running that manygrep
s and still be portable while still allowing any character in file names, you could do:The idea being to convert the output of
find
into a format suitable for xargs (that expects a blank (SPC/TAB/NL in theC
locale, YMMV in other locales) separated list of words where single, double quotes and backslashes can escape blanks and each other).Generally you can't post-process the output of
find -print
, because it separates the file names with a newline character and doesn't escape the newline characters that are found in file names. For instance if we see:We've got no way to know whether it's one file called
b
in a directory calleda<NL>.
or if it's the two filesa
andb
in the current directory.By using
.//.
, because//
cannot appear otherwise in a file path as output byfind
(because there's no such thing as a directory with an empty name and/
is not allowed in a file name), we know that if we see a line that contains//
, then that's the first line of a new filename. So we can use thatawk
command to escape all newline characters but those that precede those lines.If we take the example above,
find
would output in the first case (one file):Which awk escapes to:
So that
xargs
sees it as one argument. And in the second case (two files):Which
awk
would leave as is, soxargs
sees two arguments.You need the
LC_ALL=C
sosed
,awk
(and some implementations ofxargs
) work for arbitrary sequences of bytes (even though that don't form valid characters in the user's locale), to simplify the blank definition to just SPC and TAB and to avoid problems with different interpretations of characters whose encoding contains the encoding of backslash by the different utilities.