Linux – Trouble using wget or httrack to mirror archived website

httracklinuxwebarchivewget

I am trying to use wget to create a local mirror of a website. But I am finding that I am not getting all the linking pages.

Here is the website

http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/

I don't want all pages that begin with web.archive.org, but I do want all pages that begin with http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/.

When I use wget -r, in my file structure I find

web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/index.html,

but I don't have all files that are part of this database, e.g.

web.archive.org/web/20110808041151/http://cst-www.nrl.navy.mil/lattice/struk/d0c.html.

Perhaps httrack would do better, but right now that's grabbing too much.

So, by which means is it possible to grab a local copy of an archived website from the Internet Archive Wayback Machine?

Best Answer

While helpful, prior responses fail to concisely, reliably, and repeatably solve the underlying question. In this post, we briefly detail the difficulties with each and then offer a modest httrack-based solution.

Background

Before we get to that, however, consider perusing mpy's well-written response. In h[is|er] sadly neglected post, mpy rigorously documents the Wayback Machine's obscure (and honestly obfuscatory) archival scheme.

Unsurprisingly, it ain't pretty. Rather than sanely archiving sites into a single directory, The Wayback Machine ephemerally spreads a single site across two or more numerically identified sibling directories. To say that this complicates mirroring would be a substantial understatement.

Understanding the horrible pitfalls presented by this scheme is core to understanding the inadequacy of prior solutions. Let's get on with it, shall we?

Prior Solution 1: wget

The related StackOverflow question "Recover old website off waybackmachine" is probably the worst offender in this regard, recommending wget for Wayback mirroring. Naturally, that recommendation is fundamentally unsound.

In the absence of complex external URL rewriting (e.g., Privoxy), wget cannot be used to reliably mirror Wayback-archived sites. As mpy details under "Problem 2 + Solution," whatever mirroring tool you choose must allow you to non-transitively download only URLs belonging to the target site. By default, most mirroring tools transitively download all URLs belonging to both the target site and sites linked to from that site – which, in the worst case, means "the entire Internet."

A concrete example is in order. When mirroring the example domain kearescue.com, your mirroring tool must:

  • Include all URLs matching https://web.archive.org/web/*/http://kearescue.com. These are assets provided by the target site (e.g., https://web.archive.org/web/20140521010450js_/http_/kearescue.com/media/system/js/core.js).
  • Exclude all other URLs. These are assets provided by other sites merely linked to from the target site (e.g., https://web.archive.org/web/20140517180436js_/https_/connect.facebook.net/en_US/all.js).

Failing to exclude such URLs typically pulls in all or most of the Internet archived at the time the site was archived, especially for sites embedding externally-hosted assets (e.g., YouTube videos).

That would be bad. While wget does provide a command-line --exclude-directories option accepting one or more patterns matching URLs to be excluded, these are not general-purpose regular expressions; they're simplistic globs whose * syntax matches zero or more characters excluding /. Since the URLs to be excluded contain arbitrarily many / characters, wget cannot be used to exclude these URLs and hence cannot be used to mirror Wayback-archived sites. Period. End of unfortunate story.

This issue has been on public record since at least 2009. It has yet to be be resolved. Next!

Prior Solution 2: Scrapbook

Prinz recommends ScrapBook, a Firefox plugin. A Firefox plugin.

That was probably all you needed to know. While ScrapBook's Filter by String... functionality does address the aforementioned "Problem 2 + Solution," it does not address the subsequent "Problem 3 + Solution" – namely, the problem of extraneous duplicates.

It's questionable whether ScrapBook even adequately addresses the former problem. As mpy admits:

Although Scrapbook failed so far to grab the site completely...

Unreliable and overly simplistic solutions are non-solutions. Next!

Prior Solution 3: wget + Privoxy

mpy then provides a robust solution leveraging both wget and Privoxy. While wget is reasonably simple to configure, Privoxy is anything but reasonable. Or simple.

Due to the imponderable technical hurdle of properly installing, configuring, and using Privoxy, we have yet to confirm mpy's solution. It should work in a scalable, robust manner. Given the barriers to entry, this solution is probably more appropriate to large-scale automation than the average webmaster attempting to recover small- to medium-scale sites.

Is wget + Privoxy worth a look? Absolutely. But most superusers might be better serviced by simpler, more readily applicable solutions.

New Solution: httrack

Enter httrack, a command-line utility implementing a superset of wget's mirroring functionality. httrack supports both pattern-based URL exclusion and simplistic site restructuring. The former solves mpy's "Problem 2 + Solution"; the latter, "Problem 3 + Solution."

In the abstract example below, replace:

  • ${wayback_url} by the URL of the top-level directory archiving the entirety of your target site (e.g., 'https://web.archive.org/web/20140517175612/http://kearescue.com').
  • ${domain_name} by the same domain name present in ${wayback_url} excluding the prefixing http:// (e.g., 'kearescue.com').

Here we go. Install httrack, open a terminal window, cd to the local directory you'd like your site to be downloaded to, and run the following command:

httrack\
    ${wayback_url}\
    '-*'\
    '+*/${domain_name}/*'\
    -N1005\
    --advanced-progressinfo\
    --can-go-up-and-down\
    --display\
    --keep-alive\
    --mirror\
    --robots=0\
    --user-agent='Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5'\
    --verbose

On completion, the current directory should contain one subdirectory for each filetype mirrored from that URL. This usually includes at least:

  • css, containing all mirrored CSS stylesheets.
  • html, containing all mirrored HTML pages.
  • js, containing all mirrored JavaScript.
  • ico, containing one mirrored favicon.

Since httrack internally rewrites all downloaded content to reflect this structure, your site should now be browsable as is without modification. If you prematurely halted the above command and would like to continue downloading, append the --continue option to the exact same command and retry.

That's it. No external contortions, error-prone URL rewriting, or rule-based proxy servers required.

Enjoy, fellow superusers.

Related Question