If you have a relatively simple file tree where you have only one level of directories, and where each directory contains a list of files but there are no sub directories, you should be able to do something like this (you can paste this directly into your terminal and hit Enter):
for dir in *; do ## For each directory
if [ "$(ls -A "$dir")" ]; then ## If the dir is not empty
for file in "$dir"/*; do ## For each file in $dir
i=0; ## initialize a counter
## Get the subject
sub=$(grep ^Subject: "$file" | cut -d ':' -f 2-);
## get the date, and format it to MMDDYY_Hour:Min:Sec
date=$(date -d "$(grep ^Date: $file | cut -d ':' -f 2-)" +%m%d%y_%H:%M:%S);
## the pdf's name will be <directory's name> _ <date> _ <subject>
name="$dir"_"$date"_"$sub";
## if a file of this name exists
while [ -e "$dir/$name".pdf ]; do
let i++; ## increment the counter
name="$dir"_"$date"_"$sub"$i; ## append it to the pdf's name
done;
wkhtmltopdf "$file" "$dir"/"$name".pdf; ## convert html to pdf
done
fi
done
NOTES
This solution requires wkhtmltopdf
:
Simple shell utility to convert html to pdf using the webkit rendering engine, and qt.
On Debian based systems you can install it with
sudo apt-get install wkhtmltopdf
It assumes there are no files in the top level directory and only desired html files in all sub directories.
It can deal with file and directory names that contain spaces, new lines and other unorthodox characters.
Given a file dir1/foo
with the contents of the example you have posted, it will create a file called dir1/dir1_020513_20:59:53_Civilized Discourse Construction Kit10.pdf
On GNU/Linux, the two modes always produce same result
Yes, explicitly. From man md5sum
:
Note: There is no difference between binary and text mode option on [sic] GNU system.
This is from the md5sum
implementation that ships with GNU coreutils 8.21; I notice an older version (8.12) does not have this notice but I presume the same would be true anyway.
Although AFAICT md5sum
is not officially standardized (e.g., by POSIX), it is available on various platforms in various implementations and there is obviously some effort to make these compliant with one another for ease of use across systems.
In relation to this, the ISO/ANSI C Standard includes high level stream functions for accessing files. As part of the standard, these are available on any operating system which implements ISO C via a shared library or a compiler. Since pretty much all operating systems have this available (and are themselves most often written in C), it is a sort of universal language used to implement potentially very portable software.
Considering what it does, it would be totally feasible to write an md5sum
that would compile and work on any operating system. I am not claiming this is true of the GNU coreutils version, but one of the high level file stream functions mentioned earlier is fopen()
, which is mandated by ISO C to include a b
switch used in opening a file to indicate it is being opened "as binary file". What that may mean or require of the system isn't stipulated by the standard, it's just required to exist so it can be used on system where there may be some (any) reason for it.
There is no such reason on linux/POSIX/*nix-style operating systems, so the switch does nothing. From the POSIX spec (a superset of ISO C) for fopen():
The character 'b' shall have no effect, but is allowed for ISO C standard conformance.
So, a completely portable md5sum
implementation might use the ISO high level file stream functions, since there are no other methods for accessing files in ISO C (most platforms, including POSIX complaint ones, have their own lower level methods as well, but using these would not be portable because they are not in ISO C), and it should also implement the -b
and -t
flags to add or not add the b
option to fopen()
when it reads the file. On systems where that is meaningless, it won't make any difference.
Again, I'm not saying GNU's md5sum is written in such a completely portable way or derived from one that is, but obviously it is trying to comply, in its operability, with one that is. Note that having a flag which does nothing is not the same as not having the flag -- in the former case, it is specified to be okay but do nothing, whereas in the later case using it could be an error or lead to undefined behaviour.
Best Answer
Beyond
strings
I've never seen a Unix command line tool beyondgrep
,awk
, and friends that can parse data from any text file. All these tools assume the data is in a textual format.Files such as PDF, or Word documents encode the textual data in a binary format thus rendering it inaccessible without the aid of other tools to parse it out. These tools are typically one trick ponies that are often only able to deal with one or a few of these binary file formats.
You can use the
file
command to identify the file type.Example
See
man file
for more on its usage.Lucene, Solar, and other Indexing tools
You might be able to adapt or find a tool that's included with a indexing tool such as Lucene (such as: PDF & MSOffice docs) that can parse this text out of at least a large subset of the files one would typically deal with on a daily basis. I'd expect it to be able to contend with PDF, Word, and Libre Office file formats for starters.
Other tools
These tools can atleast partially read certain binary files so I"m adding them here just so you're aware of them. They aren't all encompassing but might be useful to you none the less.