Ubuntu – How to output web page html source code into a file

14.04bashcommand linegdbw3m

My goal is to output an html source into a file, I'm using w3m to browse the web from terminal.

When using the following command on terminal: w3m <url> -dump the program displays the web site in a non interactive way, but no its html source.

If you open a web site with w3m <url>, the terminal will display the site and if you press v, then the program will display the html source, I tried to script this but with no success. I thought the command line argument -dump_source would help me, but the only thing I got is a non-human reading content, I don't know what -dump_source was supposed to return

Here is what I tried so far:

  • Use -T text/html with -dump on terminal but the output didn't changed
  • Use -T text/plain with -dump_source on terminal hoping that the non-human output would be converted to a plain text, but no success (I didn't understand what -T is used for, even after reading w3m manual by typing man w3m on terminal)
  • Knowing that pressing v while w3m is displaying a web site switch from web page content to html source code, I tried to use gdb to attach it to the w3m process and redirect its stdin and stdout to my files (input.txt, output.txt) which input.txt contains a single v, but I had no success. Doing this on my test program worked as expected. I followed what was described here. If I write ls -l /proc/<w3m_pid>/fd, where w3m_pid is my w3m process id which I got by using ps ax on terminal, I can see there is 3 file descriptors, if I try to redirect the third one, the program crashes and displays: Error occured: errorno=25
  • Redirecting the standard I/O with w3m <url> < input.txt > output.txt also did not worked
  • W3M uses keybinding to navigate on the web, it means that if you press v there is no need to hit enter, the terminal is not buffering the input, using gdb attached to the w3m process I tried to remove it by using p system ("/bin/stty cooked"), but the w3m keybind did not changed.

My question is: why redirecting I/O with gdb is not working and what I can do to get html source code? w3m have an option to output html source code that I'm missing or I would have to use another program?

PS: I need html source code for a university homework, with html source code I can create a script to browse the web and output the page into a file, than I'm supposed to use those outputs with flex to extract statics information about things on the web, like: how many the word stack appears in questions about the c language? This is my idea.

Any suggestions would be appreciated.

W3M version: 0.5.3+debian-15

GDB version: 7.7.1

Ubuntu version: 14.04

Thanks in advance!

Best Answer

Why can't you use curl?

curl web-address > file-source.

will output the source code in the file

Like this

curl http://askubuntu.com/questions/822139/how-to-output-web-page-html-source-code-into-a-file > source-html
Related Question