My goal is to output an html source into a file, I'm using w3m to browse the web from terminal.
When using the following command on terminal: w3m <url> -dump
the program displays the web site in a non interactive way, but no its html source.
If you open a web site with w3m <url>
, the terminal will display the site and if you press v, then the program will display the html source, I tried to script this but with no success. I thought the command line argument -dump_source
would help me, but the only thing I got is a non-human reading content, I don't know what -dump_source
was supposed to return
Here is what I tried so far:
- Use
-T text/html
with-dump
on terminal but the output didn't changed - Use
-T text/plain
with-dump_source
on terminal hoping that the non-human output would be converted to a plain text, but no success (I didn't understand what -T is used for, even after reading w3m manual by typingman w3m
on terminal) - Knowing that pressing v while w3m is displaying a web site switch from web page content to html source code, I tried to use gdb to attach it to the w3m process and redirect its stdin and stdout to my files (input.txt, output.txt) which input.txt contains a single v, but I had no success. Doing this on my test program worked as expected. I followed what was described here. If I write
ls -l /proc/<w3m_pid>/fd
, where w3m_pid is my w3m process id which I got by usingps ax
on terminal, I can see there is 3 file descriptors, if I try to redirect the third one, the program crashes and displays: Error occured: errorno=25 - Redirecting the standard I/O with
w3m <url> < input.txt > output.txt
also did not worked - W3M uses keybinding to navigate on the web, it means that if you press v there is no need to hit enter, the terminal is not buffering the input, using gdb attached to the w3m process I tried to remove it by using
p system ("/bin/stty cooked")
, but the w3m keybind did not changed.
My question is: why redirecting I/O with gdb is not working and what I can do to get html source code? w3m have an option to output html source code that I'm missing or I would have to use another program?
PS: I need html source code for a university homework, with html source code I can create a script to browse the web and output the page into a file, than I'm supposed to use those outputs with flex to extract statics information about things on the web, like: how many the word stack appears in questions about the c language? This is my idea.
Any suggestions would be appreciated.
W3M version: 0.5.3+debian-15
GDB version: 7.7.1
Ubuntu version: 14.04
Thanks in advance!
Best Answer
Why can't you use
curl
?will output the source code in the
file
Like this