Bash – convert bulk of text files to pdf with naming based upon header file

bashcommand lineconversionpdftext processing

Knowing that "How to convert from text to .pdf" is already well answered here link and here link, I am looking for something more specific:

Using Claws-Mail [website] and a Plug-In [RSSyl] to read RSS feeds I collected a lot of text files. These I want to convert into .pdf files.

Problem: The files inside the folders are numbered [1, 2, …, 456]. Every feed has its own folder, but inside I have 'just' numbered files. Every file contains a header [followed by the message's content]:

Date: Tue,  5 Feb 2013 19:59:53 GMT
From: N/A
Subject: Civilized Discourse Construction Kit
X-RSSyl-URL: http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html
Message-ID: <http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html>
Content-Type: text/html; charset=UTF-8

<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<base href="http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html">
</head><body>
<p>URL: <a href="http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html">http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html</a></p>
<br>
<!-- RSSyl text start -->

Question: A way to convert each file into a .pdf file and rename it, based upon the name given under Subject. Super-awesome would be converting and re-naming this way:

"folder.name"_"date"_"file name" with each information taken from the header data. As there are a few hundred files, I am looking for a batch processing way.

The files are html formatted, but without a .htm[l] suffix.

Best Answer

If you have a relatively simple file tree where you have only one level of directories, and where each directory contains a list of files but there are no sub directories, you should be able to do something like this (you can paste this directly into your terminal and hit Enter):

for dir in *; do    ## For each directory
 if [ "$(ls -A "$dir")" ]; then  ## If the dir is not empty
   for file in "$dir"/*; do      ## For each file in $dir
    i=0;                         ## initialize a counter
    ## Get the subject
    sub=$(grep ^Subject: "$file" | cut -d ':' -f 2-);
    ## get the date, and format it to MMDDYY_Hour:Min:Sec
    date=$(date -d "$(grep ^Date: $file | cut -d ':' -f 2-)" +%m%d%y_%H:%M:%S);
    ## the pdf's name will be <directory's name> _ <date> _ <subject>
    name="$dir"_"$date"_"$sub";
    ## if a file of this name exists
    while [ -e "$dir/$name".pdf ]; do
      let i++;                       ## increment the counter
      name="$dir"_"$date"_"$sub"$i;  ## append it to the pdf's name
    done;
    wkhtmltopdf "$file" "$dir"/"$name".pdf; ## convert html to pdf
  done
 fi
done

NOTES

  • This solution requires wkhtmltopdf:

    Simple shell utility to convert html to pdf using the webkit rendering engine, and qt.

    On Debian based systems you can install it with

    sudo apt-get install wkhtmltopdf
    
  • It assumes there are no files in the top level directory and only desired html files in all sub directories.

  • It can deal with file and directory names that contain spaces, new lines and other unorthodox characters.

  • Given a file dir1/foo with the contents of the example you have posted, it will create a file called dir1/dir1_020513_20:59:53_Civilized Discourse Construction Kit10.pdf

Related Question