PDF Text Processing – How to Get Page Numbers of a Pattern in PDF

awkgreppdfpdfgreptext processing

I find the page numbers of a multiline pattern in a pdf file, by How shall I grep a multi-line pattern in a pdf file and in a text file? and How can I search a string in a pdf file, and find the physical page number of each page where the string appears?

$ pdfgrep -Pn '(?s)image\s+?not\s+?available'  main_text.pdf 
49: image
   not
available
51: image
   not
available
53: image
   not
available
54: image
   not
available
55: image
   not
available

I would like to extract the page number only, but because the pattern is multiline, I get

$ pdfgrep -Pn '(?s)image\s+?not\s+?available'  main_text.pdf | awk -F":" '{print $1}'
49
   not
available
51
   not
available
53
   not
available
54
   not
available
55
   not
available

instead of

I wonder how I can extract the page numbers only, regardless if the pattern is multiline? Thanks.

Best Answer

It's a bit hacky, but since you are already using a perl compatible RE, you could use \K "keep left" modifier to match everything in your expression (and anything else up to the next line end) but exclude it from the output:

pdfgrep -Pn '(?s)image\s+?not\s+?available.*?$\K'  main_text.pdf

The output will still include the : separator however.

Nested Braces

Let's take this as a test file with lots of nested braces:

a{b{c}d}e
1{2
}3{
}
5

Here is a modification to handle nested braces:

$ sed ':again;$!N;$!b again; :b; s/{[^{}]*}//g; t b' file2
ae
13
5

Explanation:

:again;$!N;$!b again

This is the same as before: it reads in the whole file.
:b

This defines a label b.
s/{[^{}]*}//g

This removes text in braces as long as the text contains no inner braces.
t b

If the above substitute command resulted in a change, jump back to label b. In this way, the substitute command is repeated until all brace-groups are removed.

Print Nth Line – How to Print Nth Line Before Each Matching Pattern

A buffer of lines needs to be used.

Give a try to this:

awk -v N=4 -v pattern="example.*pattern" '{i=(1+(i%N));if (buffer[i]&& $0 ~ pattern) print buffer[i]; buffer[i]=$0;}' file

Set N value to the Nth line before the pattern to print.

Set patternvalue to the regex to search.

buffer is an array of N elements. It is used to store the lines. Each time the pattern is found, the Nth line before the pattern is printed.

Best Answer

Related Solutions

Text Processing – How to Delete All Text Between Curly Brackets in a Multiline Text File

Nested Braces

Print Nth Line – How to Print Nth Line Before Each Matching Pattern

Related Question