I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):
jdupes . -rS -X size-:50m > myjdups.txt
This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.
Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:
jdupes -r . -X size-:50m | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n > myjdups_sorted.txt
The command line of find is made from different kinds of options, that are combined to form expressions.
The find
option -delete
is an action.
That means it is executed for each file matched so far.
As first option after the paths, all files are matched... oops!
It is dangerous - but the man page at least has a big warning:
From man find
:
ACTIONS
-delete
Delete files; true if removal succeeded. If the removal failed, an
error message is issued. If -delete fails, find's exit status will
be nonzero (when it eventually exits). Use of -delete automatically
turns on the -depth option.
Warnings: Don't forget that the find command line is evaluated as an
expression, so putting -delete first will make find try to delete
everything below the starting points you specified. When testing a
find command line that you later intend to use with -delete, you
should explicitly specify -depth in order to avoid later surprises.
Because -delete implies -depth, you cannot usefully use -prune and
-delete together.
From further up in man find
:
EXPRESSIONS
The expression is made up of options (which affect overall operation rather
than the processing of a specific file, and always return true), tests
(which return a true or false value), and actions (which have side effects
and return a true or false value), all separated by operators. -and is
assumed where the operator is omitted.
If the expression contains no actions other than -prune, -print is per‐
formed on all files for which the expression is true.
On trying out what a find
command will do:
To see what a command like
find . -name '*ar' -delete
will delete, you can first replace the action -delete
by a more harmless action - like -fls
or -print
:
find . -name '*ar' -print
This will print which files are affected by the action.
In this example, the -print can be left out. In this case, there is not action at all, so the most obvious is added implicitly: -print
. (See the second paragraph of the section "EXPRESSIONS" cited above)
Best Answer
It is a slightly long, but it is a single command-line. This looks at the contents of the files and compares them using a cryptographic hash (
md5sum
).As I said, this is a little long...
The
find
runsmd5sum
against all files in the current directory tree. Then the output issort
d by the md5 hash. Since whitespace could be in the filenames, thesed
changes the first field separator (two spaces) to a vertical pipe (very unlikely to be in a filename).The last
awk
command tracks three variables:lastid
= the md5 hash from the previous entry,lastfile
= the filename from previous entry, andfirst
= lastid was first time seen.The output includes the hash so you can see which files are duplicates of each other.
This does not indicate if files are hard links (same inode, different name); it will just compare the contents.
Update: corrected based on just basename of file.
Here, the
find
just lists the filenames, thesed
takes the basename component of the pathname and creates a two field table with the basename and the full pathname. Theawk
then creates a table ("found") of the pathnames seen, indexed by the basename and the item number; the "indices" array keeps track of how many of that basename have been seen. The "END" clause then prints out any duplicate basenames found.