Why Not to Parse ls Command and What to Use Instead

lsshell

I consistently see answers quoting this link stating definitively "Don't parse ls!" This bothers me for a couple of reasons:

It seems the information in that link has been accepted wholesale with little question, though I can pick out at least a few errors in casual reading.
It also seems as if the problems stated in that link have sparked no desire to find a solution.

From the first paragraph:

…when you ask [ls] for a list
of files, there's a huge problem: Unix allows almost any character in
a filename, including whitespace, newlines, commas, pipe symbols, and
pretty much anything else you'd ever try to use as a delimiter except
NUL. … ls separates filenames with newlines. This is fine
until you have a file with a newline in its name. And since I don't
know of any implementation of ls that allows you to terminate
filenames with NUL characters instead of newlines, this leaves us
unable to get a list of filenames safely with ls.

Bummer, right? How ever can we handle a newline terminated listed dataset for data that might contain newlines? Well, if the people answering questions on this website didn't do this kind of thing on a daily basis, I might think we were in some trouble.

The truth is though, most ls implementations actually provide a very simple api for parsing their output and we've all been doing it all along without even realizing it. Not only can you end a filename with null, you can begin one with null as well or with any other arbitrary string you might desire. What's more, you can assign these arbitrary strings per file-type. Please consider:

LS_COLORS='lc=\0:rc=:ec=\0\0\0:fi=:di=:' ls -l --color=always | cat -A
total 4$
drwxr-xr-x 1 mikeserv mikeserv 0 Jul 10 01:05 ^@^@^@^@dir^@^@^@/$
-rw-r--r-- 1 mikeserv mikeserv 4 Jul 10 02:18 ^@file1^@^@^@$
-rw-r--r-- 1 mikeserv mikeserv 0 Jul 10 01:08 ^@file2^@^@^@$
-rw-r--r-- 1 mikeserv mikeserv 0 Jul 10 02:27 ^@new$
line$
file^@^@^@$
^@

See this for more.

Now it's the next part of this article that really gets me though:

$ ls -l
total 8
-rw-r-----  1 lhunath  lhunath  19 Mar 27 10:47 a
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a?newline
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a space

The problem is that from the output of ls, neither you or the
computer can tell what parts of it constitute a filename. Is it each
word? No. Is it each line? No. There is no correct answer to this
question other than: you can't tell.

Also notice how ls sometimes garbles your filename data (in our
case, it turned the \n character in between the words "a" and
"newline" into a ?question mark…

…

If you just want to iterate over all the files in the current
directory, use a for loop and a glob:

for f in *; do
    [[ -e $f ]] || continue
    ...
done

The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

Consider the following:

printf 'touch ./"%b"\n' "file\nname" "f i l e n a m e" |
    . /dev/stdin
ls -1q

f i l e n a m e  
file?name

IFS="
" ; printf "'%s'\n" $(ls -1q)

'f i l e n a m e'
'file
name'

POSIX defines the -1 and -q ls operands so:

-q – Force each instance of non-printable filename characters and <tab>s to be written as the question-mark ( '?' ) character. Implementations
may provide this option by default if the output is to a terminal
device.

-1 – (The numeric digit one.) Force output to be one entry per line.

Globbing is not without its own problems – the ? matches any character so multiple matching ? results in a list will match the same file multiple times. That's easily handled.

Though how to do this thing is not the point – it doesn't take much to do after all and is demonstrated below – I was interested in why not. As I consider it, the best answer to that question has been accepted. I would suggest you try to focus more often on telling people what they can do than on what they can't. You're a lot less likely, as I think, to be proven wrong at least.

But why even try? Admittedly, my primary motivation was that others kept telling me I couldn't. I know very well that ls output is as regular and predictable as you could wish it so long as you know what to look for. Misinformation bothers me more than do most things.

The truth is, though, with the notable exception of both Patrick's and Wumpus Q. Wumbley's answers (despite the latter's awesome handle), I regard most of the information in the answers here as mostly correct – a shell glob is both more simple to use and generally more effective when it comes to searching the current directory than is parsing ls. They are not, however, at least in my regard, reason enough to justify either propagating the misinformation quoted in the article above nor are they acceptable justification to "never parse ls."

Please note that Patrick's answer's inconsistent results are mostly a result of him using zsh then bash. zsh – by default – does not word-split $(command substituted) results in a portable manner. So when he asks where did the rest of the files go? the answer to that question is your shell ate them. This is why you need to set the SH_WORD_SPLIT variable when using zsh and dealing with portable shell code. I regard his failure to note this in his answer as awfully misleading.

Wumpus's answer doesn't compute for me – in a list context the ? character is a shell glob. I don't know how else to say that.

In order to handle a multiple results case you need to restrict the glob's greediness. The following will just create a test base of awful file names and display it for you:

{ printf %b $(printf \\%04o `seq 0 127`) |
sed "/[^[-b]*/s///g
        s/\(.\)\(.\)/touch '?\v\2' '\1\t\2' '\1\n\2'\n/g" |
. /dev/stdin

echo '`ls` ?QUOTED `-m` COMMA,SEP'
ls -qm
echo ; echo 'NOW LITERAL - COMMA,SEP'
ls -m | cat
( set -- * ; printf "\nFILE COUNT: %s\n" $# )
}

OUTPUT

`ls` ?QUOTED `-m` COMMA,SEP
??\, ??^, ??`, ??b, [?\, [?\, ]?^, ]?^, _?`, _?`, a?b, a?b

NOW LITERAL - COMMA,SEP
?
 \, ?
     ^, ?
         `, ?
             b, [       \, [
\, ]    ^, ]
^, _    `, _
`, a    b, a
b

FILE COUNT: 12

Now I'll safe every character that isn't a /slash, -dash, :colon, or alpha-numeric character in a shell glob then sort -u the list for unique results. This is safe because ls has already safed-away any non printable characters for us. Watch:

for f in $(
        ls -1q |
        sed 's|[^-:/[:alnum:]]|[!-\\:[:alnum:]]|g' |
        sort -u | {
                echo 'PRE-GLOB:' >&2
                tee /dev/fd/2
                printf '\nPOST-GLOB:\n' >&2
        }
) ; do
        printf "FILE #$((i=i+1)): '%s'\n" "$f"
done

OUTPUT:

PRE-GLOB:
[!-\:[:alnum:]][!-\:[:alnum:]][!-\:[:alnum:]]
[!-\:[:alnum:]][!-\:[:alnum:]]b
a[!-\:[:alnum:]]b

POST-GLOB:
FILE #1: '?
           \'
FILE #2: '?
           ^'
FILE #3: '?
           `'
FILE #4: '[     \'
FILE #5: '[
\'
FILE #6: ']     ^'
FILE #7: ']
^'
FILE #8: '_     `'
FILE #9: '_
`'
FILE #10: '?
            b'
FILE #11: 'a    b'
FILE #12: 'a
b'

Below I approach the problem again but I use a different methodology. Remember that – besides \0null – the / ASCII character is the only byte forbidden in a pathname. I put globs aside here and instead combine the POSIX specified -d option for ls and the also POSIX specified -exec $cmd {} + construct for find. Because find will only ever naturally emit one / in sequence, the following easily procures a recursive and reliably delimited filelist including all dentry information for every entry. Just imagine what you might do with something like this:

#v#note: to do this fully portably substitute an actual newline \#v#
#v#for 'n' for the first sed invocation#v#
cd ..
find ././ -exec ls -1ldin {} + |
sed -e '\| *\./\./|{s||\n.///|;i///' -e \} |
sed 'N;s|\(\n\)///|///\1|;$s|$|///|;P;D'

###OUTPUT

152398 drwxr-xr-x 1 1000 1000        72 Jun 24 14:49
.///testls///

152399 -rw-r--r-- 1 1000 1000         0 Jun 24 14:49
.///testls/?
            \///

152402 -rw-r--r-- 1 1000 1000         0 Jun 24 14:49
.///testls/?
            ^///

152405 -rw-r--r-- 1 1000 1000         0 Jun 24 14:49
.///testls/?
        `///
...

ls -i can be very useful – especially when result uniqueness is in question.

ls -1iq | 
sed '/ .*/s///;s/^/-inum /;$!s/$/ -o /' | 
tr -d '\n' | 
xargs find

These are just the most portable means I can think of. With GNU ls you could do:

ls --quoting-style=WORD

And last, here's a much simpler method of parsing ls that I happen to use quite often when in need of inode numbers:

ls -1iq | grep -o '^ *[0-9]*'

That just returns inode numbers – which is another handy POSIX specified option.

Best Answer

I am not at all convinced of this, but let's suppose for the sake of argument that you could, if you're prepared to put in enough effort, parse the output of ls reliably, even in the face of an "adversary" — someone who knows the code you wrote and is deliberately choosing filenames designed to break it.

Even if you could do that, it would still be a bad idea.

Bourne shell is not a good language. It should not be used for anything complicated, unless extreme portability is more important than any other factor (e.g. autoconf).

I claim that if you're faced with a problem where parsing the output of ls seems like the path of least resistance for a shell script, that's a strong indication that whatever you are doing is too complicated for shell and you should rewrite the entire thing in Perl or Python. Here's your last program in Python:

import os, sys
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
      ino = os.lstat(os.path.join(subdir, f)).st_ino
      sys.stdout.write("%d %s %s\n" % (ino, subdir, f))

This has no issues whatsoever with unusual characters in filenames -- the output is ambiguous in the same way the output of ls is ambiguous, but that wouldn't matter in a "real" program (as opposed to a demo like this), which would use the result of os.path.join(subdir, f) directly.

Equally important, and in stark contrast to the thing you wrote, it will still make sense six months from now, and it will be easy to modify when you need it to do something slightly different. By way of illustration, suppose you discover a need to exclude dotfiles and editor backups, and to process everything in alphabetical order by basename:

import os, sys
filelist = []
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
        if f[0] == '.' or f[-1] == '~': continue
        lstat = os.lstat(os.path.join(subdir, f))
        filelist.append((f, subdir, lstat.st_ino))

filelist.sort(key = lambda x: x[0])
for f, subdir, ino in filelist: 
   sys.stdout.write("%d %s %s\n" % (ino, subdir, f))

UPDATE (2014-02-02)

Thanks to our very own @Anthon's determination in following the lack of this feature up, we have a slightly more formal reason as to why this feature is lacking, which reiterates what I explained earlier:

Re: [PATCH] ls: adding --zero/-z option, including tests

From:      Pádraig Brady
Subject:   Re: [PATCH] ls: adding --zero/-z option, including tests
Date:      Mon, 03 Feb 2014 15:27:31 +0000
Thanks a lot for the patch. If we were to do this then this is the interface we would use. However ls is really a tool for direct consumption by a human, and in that case further processing is less useful. For futher processing, find(1) is more suited. That is well described in the first answer at the link above.

So I'd be 70:30 against adding this.

My original answer

This is a bit of my personal opinion but I believe it to be a design decision in leaving that switch out of ls. If you notice the find command does have this switch:

-print0
      True; print the full file name on the standard output, followed by a 
      null character (instead of the newline character that -print uses).  
      This allows file  names  that  contain  newlines or other types of white 
      space to be correctly interpreted by programs that process the find 
      output.  This option corresponds to the -0 option of xargs.

By leaving that switch out, the designers were implying that you should not be using ls output for anything other than human consumption. For downstream processing by other tools, you should be using find instead.

Ways to use find

If you're just looking for the alternative methods you can find them here, titled: Doing it correctly: A quick summary. From that link these are likely the 3 more common patterns:

Simple find -exec; unwieldy if COMMAND is large, and creates 1 process/file:
```
find . -exec COMMAND... {} \;
```
Simple find -exec with +, faster if multiple files are okay for COMMAND:
```
find . -exec COMMAND... {} \+
```
Use find and xargs with \0 separators
(nonstandard common extensions -print0 and -0. Works on GNU, *BSDs, busybox)
```
find . -print0 | xargs -0 COMMAND
```

Further evidence?

I found this blog post from Joey Hess' blog titled: "ls: the missing options". One of the interesting comments in this post:

The only obvious lack now is a -z option, which should make output filenames be NULL terminated for consuption by other programs. I think this would be easy to write, but I've been extermely busy IRL (moving lots of furniture) and didn't get to it. Any takers to write it?

Further searching I found this in the commit logs from one of the additional switches that Joey's blog post mentions, "new output format -j", so it would seem that the blog post was poking fun at the notion of ever adding a -z switch to ls.

As to the other options, multiple people agree that -e is nearly almost useful, although none of us can quite find a reason to use it. My bug report neglected to mention that ls -eR is very buggy. -j is clearly a joke.

References

Sorting LS Output by Time – Handling Too Many Files

ls -t on its own will list all files in the current directory with that sorting, without ever needing to list them on the command line at all.

If you need the recursion behaviour of find, or to do some other tests on the files, you can have find generate timestamped entries, either through stat or through GNU find's -printf option and sort it. Something like:

find . -type f -printf '%T@ %p\0' | sort -zn

-printf '%T@ %p\0' generates null-separated Unix timestamp (%T@)-filename (%p) pairs. sort -z is also a non-standard GNU extension, which uses null-delimited records to be filename-safe. The sort option is supported in most of the BSDs too, but -printf is GNU-only as far as I know.

You can cut that output back into filenames only, or any other format you like.