Wget HTTPS Download – Access with Username and Password

httpswget

I'm trying to recursively download website which is normally available only when you login.

I have valid username and password, but the problem is that I need to login through web interface, so using --user=user and --password=password doesn't help.

wget downloads only one webpage with text:
Sorry this page is not available, maybe you've forgotten to login?

Is it possible to download?

I can't use –user, –password even at the login page because there is no FTP/HTTP file retrieval login as mentioned in man wget:

--user=user
--password=password
    Specify the username user and password password for both FTP and
    HTTP file retrieval.

Classic graphical login is there.

If I try to do this: wget --save-cookies coookies --keep-session-cookies --post-data='j_username=usr&j_password=pwd' 'https://idp2.civ.cvut.cz/idp/Authn/UserPassword'. Using POST method to login and trying to save cookies, the coookies file is empty and the saved page is some error page.

The URL is https://idp2.civ.cvut.cz/idp/Authn/UserPassword. Actually, when I want to log in, it redirects me to this page and when I successfully log in, it redirects me back to the page where I was before or some page where I wanted to be after logging in (example: https://progtest.fit.cvut.cz/.

Best Answer

The session information is probably saved in a cookie to allow you to navigate to other pages after you have logged in.

If this is the case, you could do this in two steps :

  1. Use wget's --save-cookies mycookies.txt and --keep-session-cookies options on the login page of the website along with your --username and --password options
  2. Use wget's --load-cookies mycookies.txt option on the subsequent pages you are trying to retrieve.

EDIT

If the --password and --username option doesn't work, you must find out the info sent to the server by the login page and mimic it :

  • For a GET request, you can add the GET parameters directly in the address wget must fetch (make sure you properly quote the &, = and other special characters). The url would probably look something like https://the_url?user=foo&pass=bar.
  • For a POST request you can use wget's --post-data=the_needed_info option to use the post method on the needed login info.

EDIT 2

It seems that you indeed need the POST method with the j_username and j_password set. Try --post-data='j_username=yourusername&j_password=yourpassword option to wget.

EDIT 3

With the page of origin, I was able to understand a little more of what is happening. That being said, I cannot make sure that it works because, well, I don't have (nor do I want) valid credentials.

That being said, here is what's happening :

  1. The page https://progtest.fit.cvut.cz/ sets a PHPSESSID cookie and present you with login options.
  2. Clicking the login button sends a request to https://progtest.fit.cvut.cz/shibboleth-fit.php which takes the PHPSESSID cookie (not sure if it uses it) and redirects you to the SSO engine with a specially crafted url just for you which looks like this : https://idp2.civ.cvut.cz/idp/profile/SAML2/Redirect/SSO?SAMLRequest=SOME_VERY_LONG_AND_UNIQUE_ID
  3. The SSO response sets a new cookie named _idp_authn_lc_key and redirects you to the page https://idp2.civ.cvut.cz:443/idp/AuthnEngine which redirects you again to https://idp2.civ.cvut.cz:443/idp/Authn/UserPassword (the real login page)
  4. You enter your credentials and send the post data j_username and j_password along with the cookie from the SSO response
  5. ???

The first four steps can be done with wget like this :

origin='https://progtest.fit.cvut.cz/'

# Get the PHPSESSID cookie
wget --save-cookies phpsid.cki --keep-session-cookies "$origin"

# Get the _idp_authn_lc_key cookie
wget --load-cookies phpsid.cki  --save-cookies sso.cki --keep-session-cookies --header="Referer: $origin" 'https://progtest.fit.cvut.cz/shibboleth-fit.php'

# Send your credentials
wget --load-cookies sso.cki --save-cookies auth.cki --keep-session-cookies --post-data='j_username=usr&j_password=pwd' 'https://idp2.civ.cvut.cz/idp/Authn/UserPassword'

Note that wget follows redirection all by himself, which helps us quite a bit in this case.

Related Question