How to recursively grep through compressed archives

archivefindgreprecursivezip

I'm trying to find out what modules use Test::Version in cpan. So I've used minicpan to mirror it. My problem is that I need to iterate through the archives that are downloaded, and grep the files that are in the archives. Can anyone tell me how I might do this? preferably in a way that tells me which file in the archive and what line it's on.

(note: they aren't all tarballs some are zip files)

Best Answer

Ok, let's apply the unix philosophy. What are the components of this task?

  • Text search: you need a tool to search text in a file, such as grep.
  • Recursive: you need a tool to go looking for files in a directory tree, such as find.
  • Archives: you need a tool to read them.

Most unix programs operate on files. So to operate easily on archive components, you need to access them as files, in other words you need to access them as directories.

The AVFS filesystem presents a view of the filesystem where every archive file /path/to/foo.zip is accessible as a directory ~/.avfs/path/to/foo/zip#. AVFS provides read-only access to most common archive file formats.

mountavfs
find ~/.avfs"$PWD" \( -name '*.zip' -o -name '*.tar.gz' -o -name '*.tgz' \) \
     -exec sh -c '
                  find "$0#" -name "*.pm" -exec grep "$1" {\} +
                 ' {} 'Test::Version' \;
fusermount -u ~/.avfs   # optional

Explanations:

  • Mount the AVFS filesystem.
  • Look for archive files in ~/.avfs$PWD, which is the AVFS view of the current directory.
  • For each archive, execute the specified shell snippet (with $0 = archive name and $1 = pattern to search).
  • $0# is the directory view of the archive $0.
  • {\} rather than {} is needed in case the outer find substitutes {} inside -exec ; arguments (some do it, some don't).
  • Optional: finally unmount the AVFS filesystem.

Or in zsh ≥4.3:

mountavfs
grep 'Test::Version' ~/.avfs$PWD/**/*.(tgz|tar.gz|zip)(e\''
     reply=($REPLY\#/**/*.pm(.N))
'\')

Explanations:

  • ~/.avfs$PWD/**/*.(tgz|tar.gz|zip) matches archives in the AVFS view of the current directory and its subdirectories.
  • PATTERN(e\''CODE'\') applies CODE to each match of PATTERN. The name of the matched file is in $REPLY. Setting the reply array turns the match into a list of names.
  • $REPLY\# is the directory view of the archive.
  • $REPLY\#/**/*.pm matches .pm files in the archive.
  • The N glob qualifier makes the pattern expand to an empty list if there is no match.
Related Question