Bash Command Line – How to Edit a List of Generated Files with Spaces in Names

bashcommand linevixargs

I've got a relatively small list of filenames generated from a pipeline based on find. The file names contain spaces and possibly punctuation but definitely no other non-printing characters or newlines.

For example,

Netherlands/Purge (GDPR) 2020-01-09.txt
Netherlands/Purge (GDPR) 2020-01-27.txt
Switzerland/New mailing 2020-01-27.txt

I want to edit these files as a set (vi file1 file2 file3 rather than vi file1; vi file2; vi file3), partly so that I can easily jump forwards and backwards between them.

I've started with Using a generated list of filenames as argument list — with spaces
, which has a standard find -print0 | xargs -0 mycommand solution. Unfortunately this does not work when mycommand is an editor because although xargs can assemble the set of files to edit, stdin is already taken up from the pipeline and I can't see a way to run an editor in-place. I can't use find -exec vi {} + because I'm using a pipeline to validate the set of filenames, and not just find itself.

My other option is to copy and paste, assembling the list of file names, surrounding them with quotes, and then prefixing the result with vi. For these three files it's trivial, but in the general case it's not an easily-reusable solution,

vi 'Netherlands/Purge (GDPR) 2020-01-09.txt' 'Netherlands/Purge (GDPR) 2020-01-27.txt' 'Switzerland/New mailing 2020-01-27.txt'

Given a GNU/Linux platform with bash as my preferred shell (in case it matters), how can I edit a similarly generated list of files?

Best Answer

From the comments I get something similar like this is your command:

find -type f -mtime +14 -mtime -22 -iname '*.xml' | while IFS= read -f x; do xmlstarlet sel -T -t -v '//magicElement' -n "$x" | grep -q magicValue && echo "$x"; done

Instead of piping to a while - loop you could use -exec sh -c '...' to filter files:

find -type f -mtime +14 -mtime -22 -iname '*.xml' \
  -exec sh -c 'xmlstarlet sel -T -t -v "//magicElement" "$1" | grep -q magicValue' find-sh {} \; \
  -exec vi -- {} +

Try:

Consider three files:

.
├── a:<magicElement>magicValue</magicElement>
├── b:<magicElement>magicValue</magicElement>
└── c:<magicElement>someOtherValue</magicElement>

$ find . -type f \
  -exec sh -c 'xmlstarlet sel -T -t -v "//magicElement" "$1" | grep -q magicValue' find-sh {} \; \
  -exec echo vi -- {} +

Output:

vi -- ./a ./b

Example

Say I have the following sample directory.

$ tree
.
|-- dir1
|   `-- a\ file1.txt
|-- dir2
|   `-- a\ file2.txt
|-- dir3
|   `-- a\ file3.txt
`-- myscript

3 directories, 4 files

Now let's say I have this for ./myscript.

#!/bin/bash

for i in "$@"; do
    echo "file: $i"
done

Now when I run the following command.

$ find . -type f -print0 | xargs -r0 ./myscript 
file: ./dir2/a file2.txt
file: ./dir3/a file3.txt
file: ./dir1/a file1.txt
file: ./myscript

Or when I use the 2nd form like so:

$ find . -type f -exec ./myscript {} +
file: ./dir2/a file2.txt
file: ./dir3/a file3.txt
file: ./dir1/a file1.txt
file: ./myscript

Details

find + xargs

The above 2 methods, though looking different, are essentially the same. The first is taking the output from find, splitting it using NULLs (\0) via the -print0 switch to find. The xargs -0 is specifically designed to take input that's split using NULLs. That non-standard syntax was introduced by GNU find and xargs but is also found nowadays in a few others like most recent BSDs. The -r option is required to avoid calling myscript if find finds nothing with GNU find but not with BSDs.

NOTE: This entire approach hinges on the fact that you'll never pass a string that's exceedingly long. If it is, then a 2nd invocation of ./myscript will get kicked off with the remainder of subsequent results from find.

find with +

That's the standard way (though it was only added relatively recently (2005) to the GNU implementation of find). The ability to do what we're doing with xargs is literally built into find. So find will find a list of files and then pass that list as as many arguments as can fit to the command specified after -exec (note that {} can only be last just before + in this case), running the commands several times if needed.

Why no quoting?

In the first example we're taking a shortcut by completely avoiding the issues with the quoting, by using NULLs to separate the arguments. When xargs is given this list it's instructed to split on the NULLs effectively protecting our individual command atoms.

In the second example we're keeping the results internal to find and so it knows what each file atom is, and will guarantee to handle them appropriately, thereby avoiding the whoie business of quoting them.

Maximum size of command line?

This question comes up from time to time so as a bonus I'm adding it to this answer, mainly so I can find it in the future. You can use xargs to see what the environment's limit like so:

$ xargs --show-limits
Your environment variables take up 4791 bytes
POSIX upper limit on argument length (this system): 2090313
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2085522
Size of command buffer we are actually using: 131072

Command-Line – Get List of Subdirectories Containing a File with Specific String

find . -type f -name '*f*' | sed -r 's|/[^/]+$||' |sort |uniq

The above finds all files below the current directory (.) that are regular files (-type f) and have f somewhere in their name (-name '*f*'). Next, sed removes the file name, leaving just the directory name. Then, the list of directories is sorted (sort) and duplicates removed (uniq).

The sed command consists of a single substitute. It looks for matches to the regular expression /[^/]+$ and replaces anything matching that with nothing. The dollar sign means the end of the line. [^/]+' means one or more characters that are not slashes. Thus, /[^/]+$ means all characters from the final slash to the end of the line. In other words, this matches the file name at the end of the full path. Thus, the sed command removes the file name, leaving unchanged the name of directory that the file was in.

Simplifications

Many modern sort commands support a -u flag which makes uniq unnecessary. For GNU sed:

find . -type f -name '*f*' | sed -r 's|/[^/]+$||' |sort -u

And, for MacOS sed:

find . -type f -name '*f*' | sed -E 's|/[^/]+$||' |sort -u

Also, if your find command supports it, it is possible to have find print the directory names directly. This avoids the need for sed:

find . -type f -name '*f*' -printf '%h\n' | sort -u

More robust version (Requires GNU tools)

The above versions will be confused by file names that include newlines. A more robust solution is to do the sorting on NUL-terminated strings:

find . -type f -name '*f*' -printf '%h\0' | sort -zu | sed -z 's/$/\n/'