Shell – Combine Files Containing Strings into One Document

grepshell

Based on this script:

find . -name "*.txt" | grep 'LINUX/UNIX'

and

find . -name "*.txt" | grep 'LINUX/UNIX' | xargs cp <to a path>

from Here
, I can grep files for a certain string which are then being copied into one directory, if they contain that string, yet they are then being kept as separate files. How can I cat these files into one coherent document?

Example
What I have in mind is the following: I have an archive of quotations spread out in separate files across hundreds of folders, the name of the folders being the respective topic. So "philosophy/ontology/concepts/aletheia/notes.tex" will contain all my notes on the philosophical concept of aletheia etc.

They all follow the some naming convention (name is always: notes.tex), so grepping them is easy. I can search them via grep, but I would like to be able to have a script which not only finds them, but also concatenates all files which contain the respective string into one large file.

Best Answer

To select regular files with names matching *.txt, in the current directory or below, that contain a particular string (not that contains matches of a particular regular expression), and to concatenate these files together in the order they were found, you may use

find . -name '*.txt' -type f -exec grep -q -F 'LINUX/UNIX' {} \; -exec cat {} + >myfile

or

find . -name '*.txt' -type f -exec sh -c '
    for pathname do
        grep -q -F "LINUX/UNIX" "$pathname" && cat "$pathname"
    done' sh {} + >myfile

The grep utility is used here with its -q option. This makes it not output anything, but as soon as the given pattern matches, it terminates with a zero exit status, signalling "success". We use this exit status as a test in both the commands above, to select only those files that contain the string LINUX/UNIX.

The -F option to grep make it interpret the pattern as a string rather than as a regular expression. This potentially makes the command a bit faster, but also means you don't have to worry about searching for strings like *this* without having to treat the * character specially (as it's special in regular expressions).

Both commands writes the concatenated file data to a file called myfile. If that file already exists, it will be truncated (emptied), otherwise it will be created. I intentionally picked an output filename that would not be found by the find command, i.e. one that does not end with .txt.


Note that question currently contains code that seems to filter the output of find with grep, to then call cp via xargs. This is not the question's user's own code, and it has several issues. One issue is that it does not concatenate the contents of any files, and another is that it applies the grep to the pathnames outputted by find rather than to the contents of the files. See also Why is looping over find's output bad practice? which is relevant here.

To use the format of the code in the question to actually solve the issue in this question, i.e. letting find produce a list of pathnames and then, separately, have grep select the ones that we're interested, to finally cat these:

find . -name '*.txt' -type f -print0 |
xargs -0 grep -lZ -F 'LINUX/UNIX' |
xargs -0 cat >myfile

This passes a list of pathnames of files whose name ends in .txt from find to the first xargs as a nul-delimited list. The xargs utility invokes grep on these, and grep outputs the pathnames of the files that contains matches, again as a nul-delimited list. It's -l that makes it output the pathnames of the matching files, and -Z that turns this into a nul-delimited list rather than newline-delmitied list.

This list is then read by the final xargs which invokes cat on each file. The concatenated result is written to myfile as before.

Note that this is a much more awkward way of solving the issue, with potential for forgetting what format the file list is in between stages of the pipeline, and assuming that whoever runs the code must be using a GNU system, or at least GNU tools (i.e. it's hopelessly non-portable).

Related Question