Ubuntu – Find text file containing a given text ignoring new lines and spaces

grep

I have a string like: "thisissometext". I want to find all text files inside a given directory (recursively) that containg this string, or any variations of it with white spaces and/or newlines in the middle of it. For example, a text file containing "this is sometext", or "this\n issometext", "this\n isso metext" should show up in the search. How can I do this?

Best Answer

With the newer versions of GNU grep (that has the -z option) you can use this one liner:

find . -type f -exec grep -lz 'this[[:space:]]*is[[:space:]]*some[[:space:]]*text' {} +

Considering the whitespaces can come in between the words only.

If you just want to search all files recursively starting from current directory, you don't need find, you can just use grep -r (recursive). find can be used to be selective on the files to search e.g. choose files of which directory to exclude. So, simply:

grep -rlz 'this[[:space:]]*is[[:space:]]*some[[:space:]]*text' .
  • The main trick here is -z, it will treat the each line of input stream ended in ASCII NUL instead of new line, as a result we can match newlines by using usual methods.

  • [[:space:]] character class pattern indicates any whitespace characters including space, tab, CR, LF etc. So, we can use it to match all the whitespace characters that can come in between the words.

  • grep -l will print only the file names that having any of the desired patterns. If you want to print the matches also, use -H instead of -l.

On the other hand, if the whitespaces can come at any places rather than the words, this would loose its good look:

grep -rlz
't[[:space:]]*h[[:space:]]*i[[:space:]]*s[[:space:]]*i[[:space:]]*\
s[[:space:]]*s[[:space:]]*o[[:space:]]*m[[:space:]]*e[[:space:]]*\
t[[:space:]]*e[[:space:]]*x[[:space:]]*t' .

With -P (PCRE) option you can replace the [[:space:]] with \s (this would look much nicer):

grep -rlzP 't\s*h\s*i\s*s\s*i\s*s\s*s\s*o\s*m\s*e\s*\
t\s*e\s*x\s*t' .

Using @steeldriver's suggestion to get sed to generate the pattern for us would be the best option:

grep -rlzP "$(sed 's/./\\s*&/2g' <<< "thisissometext")" .
Related Question