Linux – How to efficiently dereference all symlinks in `find` *output* filenames

findlinuxsymlink

I need paths to be both fully resolved and relative to a given directory. This must be done efficiently, since the number of paths is often greater than 100,000.

Situation: I have directories which contain mostly symlinks to other directories, as in

foo
 123 -> ../baz/123
 896 -> ../bar/896

(Note that foo does NOT contain only symlinks to directories, it also contains ordinary files which I have to catch too.)

These symlinked directories contain files. . I want to get a list of those files, in the form

baz/123/some.file
bar/123/other.file

That is, when 'find' finds a symlink, I want it to dereference that path when it is reporting the contents.

So I am running this command from the parent directory of foo:

find -L foo -type f

But that doesn't work.

Honestly, you would expect the -L option, which claims to 'follow symbolic links', to implement this behaviour. But, its actual behaviour is to look into the contents of those directories, but report files inside them with their non-dereferenced names, ie. the results look like

foo/baz/123/some.file
foo/bar/896/another.file

The results will be used for set operations against a list of filepaths that are all 1. fully resolved and 2. relative to foo's parent directory, so every result must also fulfil that criteria. I can guarantee for these purposes that all links are resolvable ie. none are circular or excessively deep. Most, but not all links, point at directories rather than files.

At the moment, the best I can do is a Python script that rewrites any non-dereferenced paths into resolved ones. But since the numbers of files involved is in the 100000+ range, this is not very practical (and rather ridiculous, since find already had bothered to dereference them, it just didn't return the dereferenced paths).
(EDIT: See my comment on this post — I've found a non-solution (in that it does the job efficiently but in the wrong way — executing external commands).)

I'm convinced that I should be able to do this task just with find and no external commands, but I'm not finding the man page enlightening here — none of -L, -H, -P, -follow have the correct behaviour, nor does -printf %l. -exec is out for obvious reasons — it is not internal to find. Any ideas?

EDIT 2: at this point, Stephane has convinced me that there is no particularly good reason why find would have this functionality internally, so I'm willing to accept any reasonably efficient answer.

Best Answer

What you're asking for doesn't make much sense in the general case, so it's not surprising that find has no provision for it.

A symlink with a relative target is relative to the path of the symlink. So for instance, if by traversing a directory by following symlinks, find encounters a/b/c/d and a, a/b, a/b/c are all relative or absolute symlinks (or symlinks to paths with symlink components), what should it do?

If you're looking for a find predicate or a GNU -printf % directive that expands to a symlink-free path to the file relative to the current directory or any directory, I'm afraid there's none.

If you're on Linux, you can get the absolute path of those files with:

find -L foo -type f -exec readlink -f {} \;

As you found out, there exists at least one realpath command which accepts more than one path argument which in combination with the standard -exec cmd {} + syntax is going to be a lot more efficient since it's running as few realpath commands as necessary:

find -L foo -type f -exec realpath {} +

find -L foo -type f -print0 | xargs -r0 realpath

might be quicker as if more than one realpath command is needed, find can keep on looking for more files while the first realpath starts working which even on a single processor system might make it more efficient.

-print0 and xargs -r0 are not standard, come from GNU but are found in a number of other implementations like most modern BSDs.

Zsh has builtin support for it:

print -rl foo/***/*(-.:A)

If you don't care about the sorting order, you can disable sorting and make it a bit more efficient with:

print -rl foo/***/*(-.oN:A)

If you want to convert those to relative paths to the current directory, you could have a look at that SO question.

If you know that all those files have an absolute canonical path (whose none of the components are symlinks) inside the current directory, you can simplify it to (still with zsh):

files=(foo/***/*(-.:A))
print -rl -- ${files#$PWD/}

Though short and convenient, and works whatever character filenames contain, I doubt it would faster than find + realpath.

With the Debian realpath and GNU tools, you can do:

cd -P .
find -L foo -type f -exec realpath -z {} + | 
  gawk -v p="$PWD" -v l="${#PWD}" -v RS='\0' -vORS='\0' '
    substr($0, 1, l+1) == p "/" {$0 = substr($0, l+2)}; 1' |
  xargs -r0 whatever you want to do with them

As I realise now, there's now a realpath in recent versions of GNU coreutils, which has the exact feature you're looking for, so it's just a matter of

find -L foo -type f -print0 |
  xargs -r0 realpath -z --relative-base . |
  xargs -r0 whatever you want to do with them

(use --relative-to . instead of --relative-base . if you want relative paths even for files whose symlink free path doesn't reside below the current working directory).

Related Question