Parse multiple HTML to text and rename as parent Directory

bashbash-scriptingshell

In a directory there are many folders .html inside, I would like all HTML parsed as new .txt with the name of the parent directory.

Example1/
Index.html>Example1.txt

Example2/
Index.html>Example2.txt

Best Answer

Obviously, you want to convert some HTML pages into plain text. Therefore I wouldn't strip the tags with a custom built solution (e.g. with some sed magic), but use a tool designed for that purpose like html2text; from its webpage:

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

To address your question of batch renaming:

find . -maxdepth 1 -type d -print0 | while IFS= read -r -d '' dirname
 do python path/to/html2text/html2text.py "${dirname}/index.html" > "${dirname}/${dirname}.txt"
done

Here the find command lists all directories only located in the current directory (i.e. not recursive) and the read command (in the while condition) assigns the values to the variable $dirname. Finally, the command(s) between do and done get(s) executed, here it converts the files according to your request. As pointed out by @slhck you need to use such a complex command, so that dirnames with whitespaces won't break anything.

[Edit]: Another variant to convert all HTML files under the current directory:

find . -iname "*.html" -print0 | while IFS= read -r -d '' filename
  do python path/to/html2text/html2text.py "${filename}" > "${filename%.*}.txt"
done

iname searches case insensitive for *.html.

${filename%.*}.txt strips the extention and appends .txt, i.e. if filename is some/path/index.html, ${filename%.*} is some/path/index and finally ${filename%.*}.txt is some/path/index.txt.


When you use the Z shell, you can use a much cleaner for loop, without braking at white spaces:

for i (*(/)) python path/to/html2text/html2text.py "${i}/index.html" > "${i}/${i}.txt"

The trick here is that *(/) does filename generation, but only returns directories (/).

[Edit]: Also in zsh syntax the variant to convert all HTML files under the current directory (you need the option EXTENDEDGLOB to be set):

for i ((#i)**/*.html) {
   python path/to/html2text/html2text.py "$i" > "${i:r}.txt"
}

(#i) uses case insensitive globbing, ** searches recursively, hence returning all HTML files under the current working directory. (If symbolic links should be followed, use three stars *** instead of two).

If you have more than one command inside the for loop, use curly { ... }brackets (unnecessary here, but they won't hurt).

${i:r} strips the extention (r for remove) from the variable $i.

Related Question