I tried to scrape Google image search result page with curl from Terminal, but it doesn't give me an actual html source that I can view with "View Page Source" in Firefox. I tried both "curl [url]" and "curl -L [url]". Both gave me a short html source that includes "Your client does not have permission to get URL " "from this server". How can I get the html source that I can get in Firefox with a shell script?
Part of the short html I got in Terminal said this.
Please see Google's Terms of Service posted at
http://www.google.com/terms_of_service.htmlIf you believe
that you have received this response in error, please report
your problem. However, please make sure to take a look at our Terms of
Service (http://www.google.com/terms_of_service.html). In your email,
please send us the entire code displayed below.
Best Answer
The error message contains a broken link, but Google's current terms of service say:
(emphasis mine)
They're refusing your request for some reason. It could be that they've seen suspicious activity from your IP address, but it's most probably that they've spotted that you're using
curl
instead of a regular browser (in which you would see the adverts).You could make
curl
imitate such a browser, by providing a common user-agent (eg. from http://www.browser-info.net/useragents) to the-A
option, but that would still be violating the ToS.