Get “Text Content” of Files

binaryconversiontext;

I was wondering if there is a tool to scrape text content of a file. Basically I would like to put in a pdf, word document, compressed file or anything else and get the text it contains out.

There is strings which prints out sequences of printable characters but this doesn't work on many formats that are compressed. Its "dumb" nature also means that it often prints more than just the text. It would be a good fallback for an unknown file type though.

There is the pdftotext command which does pretty much exactly what I want but it only works on pdfs.

Is there a tool that works on multiple formats or that ties more of these tools together?

Additional Information

I lot of people has been mentioning that their tools are non-standard. I would just like to point out that this is fine.

Best Answer

Beyond strings I've never seen a Unix command line tool beyond grep, awk, and friends that can parse data from any text file. All these tools assume the data is in a textual format.

Files such as PDF, or Word documents encode the textual data in a binary format thus rendering it inaccessible without the aid of other tools to parse it out. These tools are typically one trick ponies that are often only able to deal with one or a few of these binary file formats.

You can use the file command to identify the file type.

Example

$ file /usr/share/cups/data/default.pdf
/usr/share/cups/data/default.pdf: PDF document, version 1.5

See man file for more on its usage.

Lucene, Solar, and other Indexing tools

You might be able to adapt or find a tool that's included with a indexing tool such as Lucene (such as: PDF & MSOffice docs) that can parse this text out of at least a large subset of the files one would typically deal with on a daily basis. I'd expect it to be able to contend with PDF, Word, and Libre Office file formats for starters.

Other tools

These tools can atleast partially read certain binary files so I"m adding them here just so you're aware of them. They aren't all encompassing but might be useful to you none the less.

Related Question