Get “Text Content” of Files

binaryconversiontext;

I was wondering if there is a tool to scrape text content of a file. Basically I would like to put in a pdf, word document, compressed file or anything else and get the text it contains out.

There is strings which prints out sequences of printable characters but this doesn't work on many formats that are compressed. Its "dumb" nature also means that it often prints more than just the text. It would be a good fallback for an unknown file type though.

There is the pdftotext command which does pretty much exactly what I want but it only works on pdfs.

Is there a tool that works on multiple formats or that ties more of these tools together?

Additional Information

I lot of people has been mentioning that their tools are non-standard. I would just like to point out that this is fine.

Best Answer

Beyond strings I've never seen a Unix command line tool beyond grep, awk, and friends that can parse data from any text file. All these tools assume the data is in a textual format.

Files such as PDF, or Word documents encode the textual data in a binary format thus rendering it inaccessible without the aid of other tools to parse it out. These tools are typically one trick ponies that are often only able to deal with one or a few of these binary file formats.

You can use the file command to identify the file type.

Example

$ file /usr/share/cups/data/default.pdf
/usr/share/cups/data/default.pdf: PDF document, version 1.5

See man file for more on its usage.

Lucene, Solar, and other Indexing tools

You might be able to adapt or find a tool that's included with a indexing tool such as Lucene (such as: PDF & MSOffice docs) that can parse this text out of at least a large subset of the files one would typically deal with on a daily basis. I'd expect it to be able to contend with PDF, Word, and Libre Office file formats for starters.

Other tools

These tools can atleast partially read certain binary files so I"m adding them here just so you're aware of them. They aren't all encompassing but might be useful to you none the less.

Related Solutions

Bash – convert bulk of text files to pdf with naming based upon header file

If you have a relatively simple file tree where you have only one level of directories, and where each directory contains a list of files but there are no sub directories, you should be able to do something like this (you can paste this directly into your terminal and hit Enter):

for dir in *; do    ## For each directory
 if [ "$(ls -A "$dir")" ]; then  ## If the dir is not empty
   for file in "$dir"/*; do      ## For each file in $dir
    i=0;                         ## initialize a counter
    ## Get the subject
    sub=$(grep ^Subject: "$file" | cut -d ':' -f 2-);
    ## get the date, and format it to MMDDYY_Hour:Min:Sec
    date=$(date -d "$(grep ^Date: $file | cut -d ':' -f 2-)" +%m%d%y_%H:%M:%S);
    ## the pdf's name will be <directory's name> _ <date> _ <subject>
    name="$dir"_"$date"_"$sub";
    ## if a file of this name exists
    while [ -e "$dir/$name".pdf ]; do
      let i++;                       ## increment the counter
      name="$dir"_"$date"_"$sub"$i;  ## append it to the pdf's name
    done;
    wkhtmltopdf "$file" "$dir"/"$name".pdf; ## convert html to pdf
  done
 fi
done

NOTES

This solution requires wkhtmltopdf:

Simple shell utility to convert html to pdf using the webkit rendering engine, and qt.

On Debian based systems you can install it with
```
sudo apt-get install wkhtmltopdf
```
It assumes there are no files in the top level directory and only desired html files in all sub directories.
It can deal with file and directory names that contain spaces, new lines and other unorthodox characters.
Given a file dir1/foo with the contents of the example you have posted, it will create a file called dir1/dir1_020513_20:59:53_Civilized Discourse Construction Kit10.pdf

Md5sum command binary and text mode

On GNU/Linux, the two modes always produce same result

Yes, explicitly. From man md5sum:

Note: There is no difference between binary and text mode option on [sic] GNU system.

This is from the md5sum implementation that ships with GNU coreutils 8.21; I notice an older version (8.12) does not have this notice but I presume the same would be true anyway.

Although AFAICT md5sum is not officially standardized (e.g., by POSIX), it is available on various platforms in various implementations and there is obviously some effort to make these compliant with one another for ease of use across systems.

In relation to this, the ISO/ANSI C Standard includes high level stream functions for accessing files. As part of the standard, these are available on any operating system which implements ISO C via a shared library or a compiler. Since pretty much all operating systems have this available (and are themselves most often written in C), it is a sort of universal language used to implement potentially very portable software.

Considering what it does, it would be totally feasible to write an md5sum that would compile and work on any operating system. I am not claiming this is true of the GNU coreutils version, but one of the high level file stream functions mentioned earlier is fopen(), which is mandated by ISO C to include a b switch used in opening a file to indicate it is being opened "as binary file". What that may mean or require of the system isn't stipulated by the standard, it's just required to exist so it can be used on system where there may be some (any) reason for it.

There is no such reason on linux/POSIX/*nix-style operating systems, so the switch does nothing. From the POSIX spec (a superset of ISO C) for fopen():

The character 'b' shall have no effect, but is allowed for ISO C standard conformance.

So, a completely portable md5sum implementation might use the ISO high level file stream functions, since there are no other methods for accessing files in ISO C (most platforms, including POSIX complaint ones, have their own lower level methods as well, but using these would not be portable because they are not in ISO C), and it should also implement the -b and -t flags to add or not add the b option to fopen() when it reads the file. On systems where that is meaningless, it won't make any difference.

Again, I'm not saying GNU's md5sum is written in such a completely portable way or derived from one that is, but obviously it is trying to comply, in its operability, with one that is. Note that having a flag which does nothing is not the same as not having the flag -- in the former case, it is specified to be okay but do nothing, whereas in the later case using it could be an error or lead to undefined behaviour.