Better way than cp to copy millions of files while preserving hard links

cpfile-copyhard link

So in this story on the GNU coreutils ML, someone used cp to copy 430 million files and needed to preserve hard links, and just barely got it to finish after 10 days.

The big problem was that, in order to preserve hard links, cp has to keep a hashtable of already copied files, which took up 17GB of memory towards the end and had the system thrashing on swap.

Is there some utility that would have handled the task better?

Best Answer

If the tar or rsync solutions fails and if the directory is the root of a filesystem you can use the old dump/restore backup utilities (yes that stills works).

dump duplicates the filesystem characteristics without going through the kernel filesystem interface so it is quite fast.

The inconvenient is that dump is sensible to modifications made on the source file system while copying. So better umount the filesystem or remount it read only or stop any application that could access it before starting a copy. If you respect that condition the copy is reliable.

Depending on the filesystem type the dump command name can change, for instance, you can have the xfsdump for the XFS.

The following command is similar to the tar example :

dump 0uf - /dev/sdaX  | (cd /target && restore rf -)

The number is the incremental copy level; 0 indicates to do a full copy.

Related Solutions

How to copy a subset of files from a directory while preserving the folder structure

The magic of rsync filter rules:

$ rsync -av --filter="+ */" --filter="-! *blah*" /source /dest

Consult the rsync man page for the details on filter rules, but here's the condensed version for this particular need.

--filter="+ */" means "include everything that is a directory"

--filter="-! *blah* means "exclude everything that does NOT include blah in the filename"

File Copy – How to Copy Only Matching Files, Preserving Subdirectories

You can use tar or cpio or pax (if any of these is available) to copy certain files, creating target directories as necessary. With GNU tar, to copy all regular files called *.txt or README.* underneath the current directory to the same hierarchy under ../destination:

find . -type f \( -name '*.txt' -o -name 'README.*' \) |
tar -cf - -T - |
tar -xf - -C ../destination

With just find, cp, mkdir and the shell, you can loop over the desired files with find and launch a shell command to copy each of them. This is slow and cumbersome but very portable. The shell snippet receives the destination root directory as $0 and the path to the source file as $1; it creates the destination directory tree as necessary (note that directory permissions are not preserved by the code below) then copies the file. The snippet below works on any POSIX system and most BusyBox installations.

find . -type f \( -name '*.txt' -o -name 'README.*' \) -exec sh -c '
  mkdir -p "$0/${1%/*}";
  cp -p "$1" "$0/$1"
' ../destination {} \;

You can group the sh invocations; this is a little complicated but may be measurably faster.

find . -type f \( -name '*.txt' -o -name 'README.*' \) -exec sh -c '
  for x; do
    mkdir -p "$0/${x%/*}";
    cp -p "$x" "$0/$x";
  done
' ../destination {} +

If you have bash ≥4 (I don't know whether Git Bash is recent enough), you don't need to call find, you can use the ** glob pattern to recurse into subdirectories.

shopt -s globstar extglob
for x in **/@(*.txt|README.*); do
  mkdir -p "../destination/${x%/*}"
  cp -p -- "$x" "../destination/$x"
done

Best Answer

Related Solutions

How to copy a subset of files from a directory while preserving the folder structure

File Copy – How to Copy Only Matching Files, Preserving Subdirectories

Related Question