How to make wget download recursive combining –accept with –exclude-directories

httpwget

I'm trying to download some directories from an Apache server, but I need to ignore some directories that have huge files I don't care about

The dir structure in the server is somewhat like this (simplified):

somedir/
├── atxt.txt
├── big_file.pdf
├── image.jpg
└── tmp
    └── tempfile.txt

So, I want to get all the .txt and .jpg files, but I DON'T want the .pdf files nor anything that is in a tmp directory.

I've tried using --exclude-directories together with --accept and then with --reject, but in both attempts it keeps downloading the tmp dir and its contents.

These are the commands I've tried:

# with --reject
wget -nH --cut-dirs=2 -r --reject=pdf --exclude-directories=tmp \
         --no-parent  http://<host>/pub/somedir/

# with --accept
wget -nH --cut-dirs=2 -r --accept=txt,jpg --exclude-directories=tmp \
         --no-parent  http://<host>/pub/somedir/

Is there a way to do this?

How exactly is --exclude-directories supposed to work?

Best Answer

Rather than try and do this using wget I'd suggest using a more appropriate tool for downloading complex "sets" of files or filters.

You can use httrack to download either entire directories of files (essentially mirror everything from a site) or you can specify to httrack a filter along with specific file extensions, such as download only .pdf files.

You can read more about httrack's filter capability which is what you'd need to use if you were interested in only downloading files that were named in a specific way.

Here are some examples of the wildcard capability:

  • *[file] or *[name] - any filename or name, e.g. not /,? and ; characters
  • *[path] - any path (and filename), e.g. not ? and ; characters
  • *[a,z,e,r,t,y] - any letters among a,z,e,r,t,y
  • *[a-z] - any letters
  • *[0-9,a,z,e,r,t,y] - any characters among 0..9 and a,z,e,r,t,y

Example

$ httrack http://url.com/files/ -* +1_[a-z].doc -O /dir/to/output

The switches are as follows:

  • -* - remove everything from list of things to download
  • +1_[a-z].doc - download files named 1_a.doc, 1_b.doc, etc.
  • -O /dir/to/output - write results here
Related Question