I have ~30k files. Each file contains ~100k lines. A line contains no spaces. The lines within an individual file are sorted and duplicate free.
My goal: I want to find all all duplicate lines across two or more files and also the names of the files that contained duplicated entries.
A simple solution would be this:
cat *.words | sort | uniq -c | grep -v -F '1 '
And then I would run:
grep 'duplicated entry' *.words
Do you see a more efficient way?
Best Answer
Since all input files are already sorted, we may bypass the actual sorting step and just use
sort -m
for merging the files together.On some Unix systems (to my knowledge only Linux), it may be enough to do
to get the duplicated lines written to the file
dupes.txt
.To find what files these lines came from, you may then do
This will instruct
grep
to treat the lines indupes.txt
(-f dupes.txt
) as fixed string patterns (-F
).grep
will also require that the whole line matches perfectly from start to finish (-x
). It will print the file name and the line to the terminal.Non-Linux Unices (or even more files)
On some Unix systems, 30000 file names will expand to a string that is too long to pass to a single utility (meaning
sort -m *.words
will fail withArgument list too long
, which it does on my OpenBSD system). Even Linux will complain about this if the number of files are much larger.Finding the dupes
This means that in the general case (this will also work with many more than just 30000 files), one has to "chunk" the sorting:
Alternatively, creating
tmpfile
withoutxargs
:This will find all files in the current directory (or below) whose names matches
*.words
. For an appropriately sized chunk of these names at a time, the size of which is determined byxargs
/find
, it merges them together into the sortedtmpfile
file. Iftmpfile
already exists (for all but the first chunk), this file is also merged with the other files in the current chunk. Depending on the length of your filenames, and the maximum allowed length of a command line, this may require more or much more than 10 individual runs of the internal script (find
/xargs
will do this automatically).The "internal"
sh
script,uses
sort -o tmpfile
to output totmpfile
(this won't overwritetmpfile
even if this is also an input tosort
) and-m
for doing the merge. In both branches,"$@"
will expand to a list of individually quoted filenames passed to the script fromfind
orxargs
.Then, just run
uniq -d
ontmpfile
to get all line that are duplicated:If you like the "DRY" principle ("Don't Repeat Yourself"), you may write the internal script as
or
Where did they come from?
For the same reasons as above, we can't use
grep -Fx -f dupes.txt *.words
to find where these duplications came from, so instead we usefind
again:Since there is no "complicated" processing to be done, we may invoke
grep
directly from-exec
. The-exec
option takes a utility command and will place the found names in{}
. With+
at the end,find
will place as many arguments in place of{}
as the current shell supports in each invocation of the utility.To be totally correct, one may want to use either
or
to be sure that filenames are always included in the output from
grep
.The first variation uses
grep -H
to always output matching filenames. The last variation uses the fact thatgrep
will include the name of the matching file if more than one file is given on the command line.This matters since the last chunk of filenames sent to
grep
fromfind
may actually only contain a single filename, in which casegrep
would not mention it in its results.Bonus material:
Dissecting the
find
+xargs
+sh
command:find . -type f -name '*.words'
will simply generate a list of pathnames from the current directory (or below) where each pathnames is that of a regular file (-type f
) and that has a filename component at the end that matches*.words
. If only the current directory is to be searched, one may add-maxdepth 1
after the.
, before-type f
.-print0
will ensure that all found pathnames are outputted with a\0
(nul
) character as delimiter. This is a character that is not valid in a Unix path and it enables us to process pathnames even if they contain newline characters (or other weird things).find
pipes its output toxargs
.xargs -0
will read the\0
-delimited list of pathnames and will execute the given utility repeatedly with chunks of these, ensuring that the utility is executed with just enough arguments to not cause the shell to complain about a too long argument list, until there is no more input fromfind
.The utility invoked by
xargs
issh
with a script given on the command line as a string using its-c
flag.When invoking
sh -c '...some script...'
with arguments following, the arguments will be available to the script in$@
, except for the first argument, which will be placed in$0
(this is the "command name" that you may spot in e.g.top
if you are quick enough). This is why we insert the stringsh
as the first argument after the end of the actual script. The stringsh
is a dummy argument and could be any single word (some seem to prefer_
orsh-find
).