PDF Text Processing – How to Get Page Numbers of a Pattern in PDF

awkgreppdfpdfgreptext processing

I find the page numbers of a multiline pattern in a pdf file, by How shall I grep a multi-line pattern in a pdf file and in a text file? and How can I search a string in a pdf file, and find the physical page number of each page where the string appears?

$ pdfgrep -Pn '(?s)image\s+?not\s+?available'  main_text.pdf 
49: image
   not
available
51: image
   not
available
53: image
   not
available
54: image
   not
available
55: image
   not
available

I would like to extract the page number only, but because the pattern is multiline, I get

$ pdfgrep -Pn '(?s)image\s+?not\s+?available'  main_text.pdf | awk -F":" '{print $1}'
49
   not
available
51
   not
available
53
   not
available
54
   not
available
55
   not
available

instead of

49
51
53
54
55

I wonder how I can extract the page numbers only, regardless if the pattern is multiline? Thanks.

Best Answer

It's a bit hacky, but since you are already using a perl compatible RE, you could use \K "keep left" modifier to match everything in your expression (and anything else up to the next line end) but exclude it from the output:

pdfgrep -Pn '(?s)image\s+?not\s+?available.*?$\K'  main_text.pdf

The output will still include the : separator however.

Related Question