I find the page numbers of a multiline pattern in a pdf file, by How shall I grep a multi-line pattern in a pdf file and in a text file? and How can I search a string in a pdf file, and find the physical page number of each page where the string appears?
$ pdfgrep -Pn '(?s)image\s+?not\s+?available' main_text.pdf
49: image
not
available
51: image
not
available
53: image
not
available
54: image
not
available
55: image
not
available
I would like to extract the page number only, but because the pattern is multiline, I get
$ pdfgrep -Pn '(?s)image\s+?not\s+?available' main_text.pdf | awk -F":" '{print $1}'
49
not
available
51
not
available
53
not
available
54
not
available
55
not
available
instead of
49
51
53
54
55
I wonder how I can extract the page numbers only, regardless if the pattern is multiline? Thanks.
Best Answer
It's a bit hacky, but since you are already using a perl compatible RE, you could use
\K
"keep left" modifier to match everything in your expression (and anything else up to the next line end) but exclude it from the output:The output will still include the
:
separator however.