Ubuntu – How to find a string of multiple line in shell script

command linescriptstext processing

I want to find the string

Time series prediction with ensemble models

in a pdf fle using shell script.I am using pdftotext "$file" - | grep "$string".where $file is the pdf file name and $string is the above string.It can find out the line if the entire string contains in a line.but it can't find out line like:

Time series prediction with 
ensemble models

how can I resolve it.I am new to linux. so explanation in detail is appreciated.thanks in advance.

Best Answer

One possible way might be to replace grep by pcregrep (available from the 'universe' repository), which supports multiline matches, and then instead of searching for the literal string

Time series prediction with ensemble models

search instead for the perl compatible regular expression (PCRE)

Time\s+series\s+prediction\s+with\s+ensemble\s+models

where \s+ stands for one or more whitespace characters (including newlines). Using the bash shell's built-in string substitution capabilities to perform the latter step

pdftotext "$file" - | pcregrep -M "${string// /\\s+}"

If you can't use pcregrep then you might be able to get the output you want using plain grep with the -z switch: this tells grep to consider the input "lines" to be delimited by NUL characters rather than newlines - in this case, effectively making it treat the whole input as a single line. So for example if you only want to print the matches (without context)

pdftotext "$file" - | grep -zPo "${string// /\\s+}"