I have a large text file created from combining lots of html files.
cat *.html > all_html_files.txt
Inside the text file are specfic strings that I want to extract to another text file. For example:
book title>The Edge of the Round World< font 32 - extra
I want to extract all the text that occurs between the symbols >
and <
.
I want to extract The Edge of the Round World
and all other strings in the document that appear between the same symbols.
I've tried to find a solution but I can't adapt the commands I have found because I can't figure out exactly what to substitute – can't quite figure out the logic.
I am newly familiar with using sed and awk thanks to this forum.
Best Answer
...with GNU or BSD
sed
s:Here's something a little more complicated as a proof of concept:
the hardest part is filtering out all of the javascript