Bash – convert bulk of text files to pdf with naming based upon header file

Knowing that "How to convert from text to .pdf" is already well answered here link and here link, I am looking for something more specific:

Using Claws-Mail [website] and a Plug-In [RSSyl] to read RSS feeds I collected a lot of text files. These I want to convert into .pdf files.

Problem: The files inside the folders are numbered [1, 2, …, 456]. Every feed has its own folder, but inside I have 'just' numbered files. Every file contains a header [followed by the message's content]:

Date: Tue,  5 Feb 2013 19:59:53 GMT
From: N/A
Subject: Civilized Discourse Construction Kit
X-RSSyl-URL: http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html
Message-ID: <http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html>
Content-Type: text/html; charset=UTF-8

<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<base href="http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html">
</head><body>
<p>URL: <a href="http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html">http://www.codinghorror.com/blog/2013/02/civilized-discourse-construction-kit.html</a></p>
<br>
<!-- RSSyl text start -->

Question: A way to convert each file into a .pdf file and rename it, based upon the name given under Subject. Super-awesome would be converting and re-naming this way:

"folder.name"_"date"_"file name" with each information taken from the header data. As there are a few hundred files, I am looking for a batch processing way.

The files are html formatted, but without a .htm[l] suffix.

for dir in *; do ## For each directory if [ "$(ls -A "$dir")" ]; then ## If the dir is not empty for file in "$dir"/*; do ## For each file in $dir i=0; ## initialize a counter ## Get the subject sub=$(grep ^Subject: "$file" | cut -d ':' -f 2-); ## get the date, and format it to MMDDYY_Hour:Min:Sec date=$(date -d "$(grep ^Date: $file | cut -d ':' -f 2-)" +%m%d%y_%H:%M:%S); ## the pdf's name will be <directory's name> _ <date> _ <subject> name="$dir"_"$date"_"$sub"; ## if a file of this name exists while [ -e "$dir/$name".pdf ]; do let i++; ## increment the counter name="$dir"_"$date"_"$sub"$i; ## append it to the pdf's name done; wkhtmltopdf "$file" "$dir"/"$name".pdf; ## convert html to pdf done fi done

Best Answer

If you have a relatively simple file tree where you have only one level of directories, and where each directory contains a list of files but there are no sub directories, you should be able to do something like this (you can paste this directly into your terminal and hit Enter):

NOTES

This solution requires wkhtmltopdf:

Simple shell utility to convert html to pdf using the webkit rendering engine, and qt.

On Debian based systems you can install it with
```
sudo apt-get install wkhtmltopdf
```
It assumes there are no files in the top level directory and only desired html files in all sub directories.
It can deal with file and directory names that contain spaces, new lines and other unorthodox characters.
Given a file dir1/foo with the contents of the example you have posted, it will create a file called dir1/dir1_020513_20:59:53_Civilized Discourse Construction Kit10.pdf

Best Answer

Related Solutions

Convert pdf file to epub

How to create a single page PDF file out of multiple EPS files with GhostScript

Related Question