Unable to copy/mirror website-page by using WinHTTrack

httrackmirroringwebsite

I am using Httrack for copying/mirroring a website and facing one problem.

I am talking about this website. Consider I want to cover this page with all inside links (you can see like: problem 6.11 , problem 6.10 from that page). So, I've tried following:

Enter project name and URL:

screen-shot

Set option can go up and down both

enter image description here

And I started to mirror, the process finished but when I browse index.html, the main page displays correctly but further links (sab page as mentioned earlier problem 6.11, 6.10 etc) does not display – only the file name feed is shown.(try yourself to see what is going wrong)

How do I fix this issue?

Best Answer

I suggest you to read the FAQ

Here is a quote from WinHTTrack website:

Question: Some sites are captured very well, other aren't. Why?

Answer: There are several reasons (and solutions) for a mirror to fail. Reading the log files (ans this FAQ!) is generally a VERY good idea to figure out what occured.

Links within the site refers to external links, or links located in another (or upper) directories, not captured by default - the use of filters is generally THE solution, as this is one of the powerful option in HTTrack. See the above questions/answers. Website 'robots.txt' rules forbide access to several website parts - you can disable them, but only with great care! HTTrack is filtered (by its default User-agent IDentity) - you can change the Browser User-Agent identity to an anonymous one (MSIE, Netscape..) - here again, use this option with care, as this measure might have been put to avoid some bandwidth abuse (see also the abuse faq!)

There are cases, however, that can not be (yet) handled:

Flash sites - no full support

Intensive Java/Javascript sites - might be bogus/incomplete

Complex CGI with built-in redirect, and other tricks - very complicated to handle, and therefore might cause problems

Parsing problem in the HTML code (cases where the engine is fooled, for example by a false comment (
comment (-->) detected. Rare cases, but might occur. A bug report is then generally good!

Note: For some sites, setting "Force old HTTP/1.0 requests" option can be useful, as this option uses more basic requests (no HEAD request for example). This will cause a performance loss, but will increase the compatibility with some cgi-based sites.

PD. There are many reasons to website cant be captured 100% i think in SuperUser we are very enthusiast but we wont to make reverse enginerring to a website to discovery which system is running from behind(It's my oppinion).

Background

Before we get to that, however, consider perusing mpy's well-written response. In h[is|er] sadly neglected post, mpy rigorously documents the Wayback Machine's obscure (and honestly obfuscatory) archival scheme.

Unsurprisingly, it ain't pretty. Rather than sanely archiving sites into a single directory, The Wayback Machine ephemerally spreads a single site across two or more numerically identified sibling directories. To say that this complicates mirroring would be a substantial understatement.

Understanding the horrible pitfalls presented by this scheme is core to understanding the inadequacy of prior solutions. Let's get on with it, shall we?

Prior Solution 1: wget

The related StackOverflow question "Recover old website off waybackmachine" is probably the worst offender in this regard, recommending wget for Wayback mirroring. Naturally, that recommendation is fundamentally unsound.

In the absence of complex external URL rewriting (e.g., Privoxy), wget cannot be used to reliably mirror Wayback-archived sites. As mpy details under "Problem 2 + Solution," whatever mirroring tool you choose must allow you to non-transitively download only URLs belonging to the target site. By default, most mirroring tools transitively download all URLs belonging to both the target site and sites linked to from that site – which, in the worst case, means "the entire Internet."

A concrete example is in order. When mirroring the example domain kearescue.com, your mirroring tool must:

Include all URLs matching https://web.archive.org/web/*/http://kearescue.com. These are assets provided by the target site (e.g., https://web.archive.org/web/20140521010450js_/http_/kearescue.com/media/system/js/core.js).
Exclude all other URLs. These are assets provided by other sites merely linked to from the target site (e.g., https://web.archive.org/web/20140517180436js_/https_/connect.facebook.net/en_US/all.js).

Failing to exclude such URLs typically pulls in all or most of the Internet archived at the time the site was archived, especially for sites embedding externally-hosted assets (e.g., YouTube videos).

That would be bad. While wget does provide a command-line --exclude-directories option accepting one or more patterns matching URLs to be excluded, these are not general-purpose regular expressions; they're simplistic globs whose * syntax matches zero or more characters excluding /. Since the URLs to be excluded contain arbitrarily many / characters, wget cannot be used to exclude these URLs and hence cannot be used to mirror Wayback-archived sites. Period. End of unfortunate story.

This issue has been on public record since at least 2009. It has yet to be be resolved. Next!

Prior Solution 2: Scrapbook

Prinz recommends ScrapBook, a Firefox plugin. A Firefox plugin.

That was probably all you needed to know. While ScrapBook's Filter by String... functionality does address the aforementioned "Problem 2 + Solution," it does not address the subsequent "Problem 3 + Solution" – namely, the problem of extraneous duplicates.

It's questionable whether ScrapBook even adequately addresses the former problem. As mpy admits:

Although Scrapbook failed so far to grab the site completely...

Unreliable and overly simplistic solutions are non-solutions. Next!

Prior Solution 3: wget + Privoxy

mpy then provides a robust solution leveraging both wget and Privoxy. While wget is reasonably simple to configure, Privoxy is anything but reasonable. Or simple.

Due to the imponderable technical hurdle of properly installing, configuring, and using Privoxy, we have yet to confirm mpy's solution. It should work in a scalable, robust manner. Given the barriers to entry, this solution is probably more appropriate to large-scale automation than the average webmaster attempting to recover small- to medium-scale sites.

Is wget + Privoxy worth a look? Absolutely. But most superusers might be better serviced by simpler, more readily applicable solutions.

New Solution: httrack

Enter httrack, a command-line utility implementing a superset of wget's mirroring functionality. httrack supports both pattern-based URL exclusion and simplistic site restructuring. The former solves mpy's "Problem 2 + Solution"; the latter, "Problem 3 + Solution."

In the abstract example below, replace:

${wayback_url} by the URL of the top-level directory archiving the entirety of your target site (e.g., 'https://web.archive.org/web/20140517175612/http://kearescue.com').
${domain_name} by the same domain name present in ${wayback_url} excluding the prefixing http:// (e.g., 'kearescue.com').

Here we go. Install httrack, open a terminal window, cd to the local directory you'd like your site to be downloaded to, and run the following command:

httrack\
    ${wayback_url}\
    '-*'\
    '+*/${domain_name}/*'\
    -N1005\
    --advanced-progressinfo\
    --can-go-up-and-down\
    --display\
    --keep-alive\
    --mirror\
    --robots=0\
    --user-agent='Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5'\
    --verbose

On completion, the current directory should contain one subdirectory for each filetype mirrored from that URL. This usually includes at least:

css, containing all mirrored CSS stylesheets.
html, containing all mirrored HTML pages.
js, containing all mirrored JavaScript.
ico, containing one mirrored favicon.

Since httrack internally rewrites all downloaded content to reflect this structure, your site should now be browsable as is without modification. If you prematurely halted the above command and would like to continue downloading, append the --continue option to the exact same command and retry.

That's it. No external contortions, error-prone URL rewriting, or rule-based proxy servers required.

Enjoy, fellow superusers.

Best Answer

Related Solutions

Linux – Using wget to mirror a website and everything from the first level of external sites

Linux – Trouble using wget or httrack to mirror archived website

Background

Prior Solution 1: wget

Prior Solution 2: Scrapbook

Prior Solution 3: wget + Privoxy

New Solution: httrack

Related Question