Possible to extract title and pagenum of each page in a pdf file

pdftext processing

I was wondering if there are some ways to extract title and pagenum of each page in a pdf file? Either by some applications, or by programming in some programming language with some pdf libraries?

The title of each page is supposed to be the first line of the page, for example, in slides/presentation files.

The output is supposed to be a text file, with following format:

title_of_first_page pagenum_of_first_page
title_of_second_page pagenum_of_second_page
...

Best Answer

The following script will print the first line of each page of the PDF file passed as argument, followed by a space and the line number... It uses tools from Poppler (package poppler-utils on Debian or Ubuntu).

#!/bin/bash
if="$1"
pages=$(pdfinfo "$if" | sed -nre 's/^Pages: +([0-9]+)$/\1/p')
for ((i=1; i<=$pages; i++)) ;do
    printf "%s %d\n" "$(pdftotext -f $i -l $i -layout "$if" - | head -n 1)" $i
done
Related Question