Linux – HTML to UNFORMATTED plain text

htmllinux

I'm looking for a way to convert a folder full of HTML files to plain text. What I want is for the text files to be as much as possible like what I'd get if I selected all the text in a web browser, copied it, and pasted the text into a plain text file.

NO, REALLY, I WANT UNFORMATTED PLAIN TEXT. All of the solutions that I'm finding produce Markdown or something that looks like it, or tries to preserve layout, or uses asterisks and underscores to indicate text formatting, or preserves the content of scripts in the output file, or some clever goddam thing.

All I want is the words written by the author in the order that the author wrote them. I don't even care if the processing converts all of the list items in a list into a single paragraph, or even collapses the entire document into a single paragraph. Any of this is much better than giving me anything at all other than the actual language contained in the document.

I'd love a terminal application or Python script, but I'll take anything I can get.

Best Answer

html2text is a Python script that converts a page of HTML into equivalent Markdown-structured text. html2text can be downloaded and run in any operating system that has Python installed. The html2text program is in the repositories of many Linux distributions and it can be run from the command-line like this:

html2text -style pretty input.html  

This command not only converts the original html file to text, but it also does a pretty good job of making the plain text output easy to read. The headings look like headings, the lists look like lists, etc.

If you're having trouble with automatically converting tables from webpages to unformatted text this can be easily done with a modern markdown editor like Typora or Mark Text GUI applications for Windows/Mac/Linux. Comparing these two applications Mark Text is better than Typora at accurately capturing everything on a webpage and Typora has a more user-friendly editor, so I use both applications. I use Mark Text as a webpage grabber, and then I copy/paste the markdown text I captured into Typora and use Typora to edit it.

Related Question