Shell – Accessing Google-translate via wget

googleshell-scriptwget

I don't want to call the whole Firefox/Chrome/Opera… to find out the meaning of a word with the Google translate, so I decided to write a shell script which uses wget to get the content of translate.google.hu and gets the translation from the downloaded file. But I get stuck at the first step.

E.g. if I want to find out the translation (from eng to hun) of word 'Enthusiast' I would try

$ wget https://translate.google.hu/?hl=hu&tab=wT#en/hu/Enthusiast

but wget doesn't download the page that I get if I type

https://translate.google.hu/?hl=hu&tab=wT#en/hu/Enthusiast

into my browser's address bar. Instead of that I got the following:

solid@skynet:~> wget https://translate.google.hu/?hl=hu&tab=wT#en/hu/Enthusiast

[1] 2143

solid@skynet:~> --2016-05-02 08:23:24--  https://translate.google.hu/?hl=hu
Resolving translate.google.hu (translate.google.hu)... 216.58.209.163, 2a00:1450:400d:806::2003
Connecting to translate.google.hu (translate.google.hu)|216.58.209.163|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2016-05-02 08:23:24 ERROR 403: Forbidden.

And I'm waiting, and waiting and waiting… finally I press ENTER:

[1]+  Exit 8                  wget https://translate.google.hu/?hl=hu

Could someone solve my problem?

(I'm using OpenSuse Linux 13.2)

UPDATE According to [Alexander Batischev] I have tried

 $ wget 'https://translate.google.hu/?hl=hu&tab=wT#en/hu/Enthusiast'

It solved the problem of running in background, and passed to wget the proper address (instead of creating local variable 'tab') ^.^'
But I get the same error until the Forbidden:

$ wget 'https://translate.google.hu/?hl=hu&tab=wT#en/hu/Enthusiast'

--2016-05-03 14:57:48--  https://translate.google.hu/?hl=hu&tab=wT 
Resolving translate.google.hu (translate.google.hu)... 216.58.209.163,  2a00:1450:400d:806::2003
Connecting to translate.google.hu
(translate.google.hu)|216.58.209.163|:443... connected. HTTP request
sent, awaiting response... 403 Forbidden
2016-05-03 14:57:48 ERROR 403: Forbidden.

Best Answer

When you run this command:

wget https://translate.google.hu/?hl=hu&tab=wT#en/hu/Enthusiast

what really happens is:

  • you run wget with URL of "https://translate.google.hu/?hl=hu";
  • ampersand means that wget will run in background;
  • a variable named tab is defined and gets a value wT#en/hu/Enthusiast.

The reason for all this is that shell reserves some characters, ampersand included, for special things. To prevent shell from interpreting ampersand, use quotes:

wget 'https://translate.google.hu/?hl=hu&tab=wT#en/hu/Enthusiast'

With that resolved, you're still getting "Forbidden" response.

It's a race between clients who want to bypass the interface and the providers who don't want to let them. Google gets its revenue from ads, and it knows that your script won't display any. Thus, they're taking measures to forbid any access but via browser.

The only people who can tell you precisely why you have been "Forbidden" are Google engineers. That said, the easier of the techniques are well-known.

One of the easiest ones are blocking by "user agent string". This is a string identifying the make and version of the client (your browser or wget). It looks like this:

Wget/1.16.3 (linux-gnu)

The client sends this string with every request. The server can use it to tweak the appearance of the result, or to deny access, like in your case.

wget accepts --user-agent flag where you can specify the user agent string to send. To imitate your own browser, you can type "what is my user agent" into that same Google and copy the string from there :) Then, just pass it to wget like so:

wget --user-agent='Mozilla/5.0 (Windows NT 6.3; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0' \
    'https://translate.google.hu/?hl=hu&tab=wT#en/hu/Enthusiast'
Related Question