I am trying to use wget to create a local mirror of a website. But I am finding that I am not getting all the linking pages.
Here is the website
http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/
I don't want all pages that begin with web.archive.org
, but I do want all pages that begin with http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/
.
When I use wget -r
, in my file structure I find
web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/index.html,
but I don't have all files that are part of this database, e.g.
web.archive.org/web/20110808041151/http://cst-www.nrl.navy.mil/lattice/struk/d0c.html.
Perhaps httrack would do better, but right now that's grabbing too much.
So, by which means is it possible to grab a local copy of an archived website from the Internet Archive Wayback Machine?
Best Answer
While helpful, prior responses fail to concisely, reliably, and repeatably solve the underlying question. In this post, we briefly detail the difficulties with each and then offer a modest
httrack
-based solution.Background
Before we get to that, however, consider perusing mpy's well-written response. In h[is|er] sadly neglected post, mpy rigorously documents the Wayback Machine's obscure (and honestly obfuscatory) archival scheme.
Unsurprisingly, it ain't pretty. Rather than sanely archiving sites into a single directory, The Wayback Machine ephemerally spreads a single site across two or more numerically identified sibling directories. To say that this complicates mirroring would be a substantial understatement.
Understanding the horrible pitfalls presented by this scheme is core to understanding the inadequacy of prior solutions. Let's get on with it, shall we?
Prior Solution 1: wget
The related StackOverflow question "Recover old website off waybackmachine" is probably the worst offender in this regard, recommending
wget
for Wayback mirroring. Naturally, that recommendation is fundamentally unsound.In the absence of complex external URL rewriting (e.g.,
Privoxy
),wget
cannot be used to reliably mirror Wayback-archived sites. As mpy details under "Problem 2 + Solution," whatever mirroring tool you choose must allow you to non-transitively download only URLs belonging to the target site. By default, most mirroring tools transitively download all URLs belonging to both the target site and sites linked to from that site – which, in the worst case, means "the entire Internet."A concrete example is in order. When mirroring the example domain
kearescue.com
, your mirroring tool must:https://web.archive.org/web/*/http://kearescue.com
. These are assets provided by the target site (e.g.,https://web.archive.org/web/20140521010450js_/http_/kearescue.com/media/system/js/core.js
).https://web.archive.org/web/20140517180436js_/https_/connect.facebook.net/en_US/all.js
).Failing to exclude such URLs typically pulls in all or most of the Internet archived at the time the site was archived, especially for sites embedding externally-hosted assets (e.g., YouTube videos).
That would be bad. While
wget
does provide a command-line--exclude-directories
option accepting one or more patterns matching URLs to be excluded, these are not general-purpose regular expressions; they're simplistic globs whose*
syntax matches zero or more characters excluding/
. Since the URLs to be excluded contain arbitrarily many/
characters,wget
cannot be used to exclude these URLs and hence cannot be used to mirror Wayback-archived sites. Period. End of unfortunate story.This issue has been on public record since at least 2009. It has yet to be be resolved. Next!
Prior Solution 2: Scrapbook
Prinz recommends
ScrapBook
, a Firefox plugin. A Firefox plugin.That was probably all you needed to know. While
ScrapBook
'sFilter by String...
functionality does address the aforementioned "Problem 2 + Solution," it does not address the subsequent "Problem 3 + Solution" – namely, the problem of extraneous duplicates.It's questionable whether
ScrapBook
even adequately addresses the former problem. As mpy admits:Unreliable and overly simplistic solutions are non-solutions. Next!
Prior Solution 3: wget + Privoxy
mpy then provides a robust solution leveraging both
wget
andPrivoxy
. Whilewget
is reasonably simple to configure,Privoxy
is anything but reasonable. Or simple.Due to the imponderable technical hurdle of properly installing, configuring, and using
Privoxy
, we have yet to confirm mpy's solution. It should work in a scalable, robust manner. Given the barriers to entry, this solution is probably more appropriate to large-scale automation than the average webmaster attempting to recover small- to medium-scale sites.Is
wget
+Privoxy
worth a look? Absolutely. But most superusers might be better serviced by simpler, more readily applicable solutions.New Solution: httrack
Enter
httrack
, a command-line utility implementing a superset ofwget
's mirroring functionality.httrack
supports both pattern-based URL exclusion and simplistic site restructuring. The former solves mpy's "Problem 2 + Solution"; the latter, "Problem 3 + Solution."In the abstract example below, replace:
${wayback_url}
by the URL of the top-level directory archiving the entirety of your target site (e.g.,'https://web.archive.org/web/20140517175612/http://kearescue.com'
).${domain_name}
by the same domain name present in${wayback_url}
excluding the prefixinghttp://
(e.g.,'kearescue.com'
).Here we go. Install
httrack
, open a terminal window,cd
to the local directory you'd like your site to be downloaded to, and run the following command:On completion, the current directory should contain one subdirectory for each filetype mirrored from that URL. This usually includes at least:
css
, containing all mirrored CSS stylesheets.html
, containing all mirrored HTML pages.js
, containing all mirrored JavaScript.ico
, containing one mirrored favicon.Since
httrack
internally rewrites all downloaded content to reflect this structure, your site should now be browsable as is without modification. If you prematurely halted the above command and would like to continue downloading, append the--continue
option to the exact same command and retry.That's it. No external contortions, error-prone URL rewriting, or rule-based proxy servers required.
Enjoy, fellow superusers.