Shell – Combine Files Containing Strings into One Document

grepshell

Based on this script:

find . -name "*.txt" | grep 'LINUX/UNIX'

and

find . -name "*.txt" | grep 'LINUX/UNIX' | xargs cp <to a path>

from Here
, I can grep files for a certain string which are then being copied into one directory, if they contain that string, yet they are then being kept as separate files. How can I cat these files into one coherent document?

Example
What I have in mind is the following: I have an archive of quotations spread out in separate files across hundreds of folders, the name of the folders being the respective topic. So "philosophy/ontology/concepts/aletheia/notes.tex" will contain all my notes on the philosophical concept of aletheia etc.

They all follow the some naming convention (name is always: notes.tex), so grepping them is easy. I can search them via grep, but I would like to be able to have a script which not only finds them, but also concatenates all files which contain the respective string into one large file.

Best Answer

To select regular files with names matching *.txt, in the current directory or below, that contain a particular string (not that contains matches of a particular regular expression), and to concatenate these files together in the order they were found, you may use

find . -name '*.txt' -type f -exec grep -q -F 'LINUX/UNIX' {} \; -exec cat {} + >myfile

find . -name '*.txt' -type f -exec sh -c '
    for pathname do
        grep -q -F "LINUX/UNIX" "$pathname" && cat "$pathname"
    done' sh {} + >myfile

The grep utility is used here with its -q option. This makes it not output anything, but as soon as the given pattern matches, it terminates with a zero exit status, signalling "success". We use this exit status as a test in both the commands above, to select only those files that contain the string LINUX/UNIX.

The -F option to grep make it interpret the pattern as a string rather than as a regular expression. This potentially makes the command a bit faster, but also means you don't have to worry about searching for strings like *this* without having to treat the * character specially (as it's special in regular expressions).

Both commands writes the concatenated file data to a file called myfile. If that file already exists, it will be truncated (emptied), otherwise it will be created. I intentionally picked an output filename that would not be found by the find command, i.e. one that does not end with .txt.

Note that question currently contains code that seems to filter the output of find with grep, to then call cp via xargs. This is not the question's user's own code, and it has several issues. One issue is that it does not concatenate the contents of any files, and another is that it applies the grep to the pathnames outputted by find rather than to the contents of the files. See also Why is looping over find's output bad practice? which is relevant here.

To use the format of the code in the question to actually solve the issue in this question, i.e. letting find produce a list of pathnames and then, separately, have grep select the ones that we're interested, to finally cat these:

find . -name '*.txt' -type f -print0 |
xargs -0 grep -lZ -F 'LINUX/UNIX' |
xargs -0 cat >myfile

This passes a list of pathnames of files whose name ends in .txt from find to the first xargs as a nul-delimited list. The xargs utility invokes grep on these, and grep outputs the pathnames of the files that contains matches, again as a nul-delimited list. It's -l that makes it output the pathnames of the matching files, and -Z that turns this into a nul-delimited list rather than newline-delmitied list.

This list is then read by the final xargs which invokes cat on each file. The concatenated result is written to myfile as before.

Note that this is a much more awkward way of solving the issue, with potential for forgetting what format the file list is in between stages of the pipeline, and assuming that whoever runs the code must be using a GNU system, or at least GNU tools (i.e. it's hopelessly non-portable).

Related Solutions

Bash – Maintain filenames as separate arguments in successive commands

Hmm. @don_crissti already gave the answer for grep in a comment. But since you said it wasn't really about grep or ls, I'm going to rewrite the command in question to not use those commands.

What I think you want is:

do-something-with $(produce-list-of-files)

where the output of one command should be dropped in as command line parameters to another command. There just happens to be a utility just for that, it is called xargs (man page).

If the file names are "nice", we could do just

produce-list-of-files | xargs do-something-with

If the file names can contain spaces but are separated by newlines, we have to tell xargs to not split on any whitespace, but only newlines:

produce-list-of-files | xargs -d '\n' do-something-with

If the file names can contain newlines too, the list has to be separated by NULs ('\0', byte with value zero), and we need an xargs that supports it. At least some versions of various utilities support listing files separated by NULs instead of newlines, there's at least find -print0, sort -z and grep -Z in the GNU versions of those tools. In xargs the flag is --null or -0. So:

produce-list-of-files -0 | xargs -0 do-something-with

An example run with cat and, well, ls -l:

$ touch "abba acdc" "foo bar" $'new\nline'
$ echo -en "abba acdc\0foo bar\0new\nline\0" > list
$ cat list | xargs -0 ls -l
-rw-r--r-- 1 itvirta itvirta 0 Aug  2 01:23 abba acdc
-rw-r--r-- 1 itvirta itvirta 0 Aug  2 01:23 foo bar
-rw-r--r-- 1 itvirta itvirta 0 Aug  2 01:23 new?line

Bash – Grep over multiple files redirecting to a different filename each time

In ksh93/bash/zsh, with a simple for loop and parameter expansion:

for f in *-QTR*.tsv
do 
  grep 8-K < "$f" > "${f:0:4}"Q"${f:8:1}".txt
done

This runs the grep on one file at a time (where that list of files is generated from a wildcard pattern that requires "-QTR" to exist in the filename as well as a ".tsv" ending to the filename), redirecting the output to a carefully-constructed filename based on:

the first four characters of the filename -- the year
the letter Q
the 9th character of the filename -- the quarter

Best Answer

Related Solutions

Bash – Maintain filenames as separate arguments in successive commands

Bash – Grep over multiple files redirecting to a different filename each time

Related Question