I'm trying to recursively download website which is normally available only when you login.
I have valid username and password, but the problem is that I need to login through web interface, so using --user=user and --password=password
doesn't help.
wget
downloads only one webpage with text:
Sorry this page is not available, maybe you've forgotten to login?
Is it possible to download?
I can't use –user, –password even at the login page because there is no FTP/HTTP file retrieval login as mentioned in man wget
:
--user=user
--password=password
Specify the username user and password password for both FTP and
HTTP file retrieval.
Classic graphical login is there.
If I try to do this: wget --save-cookies coookies --keep-session-cookies --post-data='j_username=usr&j_password=pwd' 'https://idp2.civ.cvut.cz/idp/Authn/UserPassword'
. Using POST method to login and trying to save cookies, the coookies file is empty and the saved page is some error page.
The URL is https://idp2.civ.cvut.cz/idp/Authn/UserPassword
. Actually, when I want to log in, it redirects me to this page and when I successfully log in, it redirects me back to the page where I was before or some page where I wanted to be after logging in (example: https://progtest.fit.cvut.cz/
.
Best Answer
The session information is probably saved in a cookie to allow you to navigate to other pages after you have logged in.
If this is the case, you could do this in two steps :
wget
's--save-cookies mycookies.txt
and--keep-session-cookies
options on the login page of the website along with your--username
and--password
optionswget
's--load-cookies mycookies.txt
option on the subsequent pages you are trying to retrieve.EDIT
If the
--password
and--username
option doesn't work, you must find out the info sent to the server by the login page and mimic it :GET
request, you can add theGET
parameters directly in the address wget must fetch (make sure you properly quote the&
,=
and other special characters). The url would probably look something likehttps://the_url?user=foo&pass=bar
.POST
request you can usewget
's--post-data=the_needed_info
option to use the post method on the needed login info.EDIT 2
It seems that you indeed need the
POST
method with thej_username
andj_password
set. Try--post-data='j_username=yourusername&j_password=yourpassword
option towget
.EDIT 3
With the page of origin, I was able to understand a little more of what is happening. That being said, I cannot make sure that it works because, well, I don't have (nor do I want) valid credentials.
That being said, here is what's happening :
https://progtest.fit.cvut.cz/
sets aPHPSESSID
cookie and present you with login options.login
button sends a request tohttps://progtest.fit.cvut.cz/shibboleth-fit.php
which takes the PHPSESSID cookie (not sure if it uses it) and redirects you to the SSO engine with a specially crafted url just for you which looks like this :https://idp2.civ.cvut.cz/idp/profile/SAML2/Redirect/SSO?SAMLRequest=SOME_VERY_LONG_AND_UNIQUE_ID
_idp_authn_lc_key
and redirects you to the pagehttps://idp2.civ.cvut.cz:443/idp/AuthnEngine
which redirects you again tohttps://idp2.civ.cvut.cz:443/idp/Authn/UserPassword
(the real login page)j_username
andj_password
along with the cookie from the SSO responseThe first four steps can be done with
wget
like this :Note that
wget
follows redirection all by himself, which helps us quite a bit in this case.