Have you looked at Lucene or Sphinx? While you will need to initially parse the documents you want to index, once that's done, either one can search from the cli.
For Lucene, there is some info on doing this available.
Sphinx, is a bit more vague, but there is also some documentation available. You can pass structured XML data of your choice to sphinx via the xmlpipe2 data source.
Lucene relies on Java, while Sphinx is built in C++ with no needed outside dependencies.
Either one is going to require a bit of work to do what you want, but, seems like a totally workable solution.
It's not clear what you mean by "quality loss". That could mean a lot of different things. Could you post some samples to illustrate? Perhaps cut the same section out of the poor quality and good quality versions (as a PNG to avoid further quality loss).
Perhaps you need to use -density
to do the conversion at a higher dpi:
convert -density 300 file.pdf page_%04d.jpg
(You can prepend -units PixelsPerInch
or -units PixelsPerCentimeter
if necessary. My copy defaults to ppi.)
Update: As you pointed out, gscan2pdf
(the way you're using it) is just a wrapper for pdfimages
(from poppler). pdfimages
does not do the same thing that convert
does when given a PDF as input.
convert
takes the PDF, renders it at some resolution, and uses the resulting bitmap as the source image.
pdfimages
looks through the PDF for embedded bitmap images and exports each one to a file. It simply ignores any text or vector drawing commands in the PDF.
As a result, if what you have is a PDF that's just a wrapper around a series of bitmaps, pdfimages
will do a much better job of extracting them, because it gets you the raw data at its original size. You probably also want to use the -j
option to pdfimages
, because a PDF can contain raw JPEG data. By default, pdfimages
converts everything to PNM format, and converting JPEG > PPM > JPEG is a lossy process.
So, try
pdfimages -j file.pdf page
You may or may not need to follow that with a convert
to .jpg
step (depending on what bitmap format the PDF was using).
I tried this command on a PDF that I had made myself from a sequence of JPEG images. The extracted JPEGs were byte-for-byte identical to the source images. You can't get higher quality than that.
Best Answer
.epub
files are.zip
files containing XHTML and CSS and some other files (including images, various metadata files, and maybe an XML file calledtoc.ncx
containing the table of contents).The following script uses
unzip -p
to extracttoc.ncx
to stdout, pipe it through the xml2 command, thensed
to extract just the text of each chapter heading.It takes one or more filename arguments on the command line.
It outputs the epub's filename followed by a
:
, then indents each chapter title by two spaces on the following lines. For example:If an epub file doesn't contain a
toc.ncx
, you'll see output like this for that particular book:The first error line is from
unzip
, the second fromxml2
.xml2
will also warn about other errors it finds - e.g. an improperly formattedtoc.ncx
file.Note that the error messages are on stderr, while the book's filename is still on stdout.
xml2
is available pre-packaged for Debian, Ubuntu and other debian-derivatives, and probably most other Linux distros too.For simple tasks like this (i.e. where you just want to convert XML into a line-oriented format for use with
sed
,awk
,cut
,grep
, etc),xml2
is simpler and easier to use thanxmlstarlet
.BTW, if you want to print the epub's title as well, change the
sed
script to:or replace it with an
awk
script: