I'm trying to download some directories from an Apache server, but I need to ignore some directories that have huge files I don't care about
The dir structure in the server is somewhat like this (simplified):
somedir/
├── atxt.txt
├── big_file.pdf
├── image.jpg
└── tmp
└── tempfile.txt
So, I want to get all the .txt
and .jpg
files, but I DON'T want the .pdf
files nor anything that is in a tmp
directory.
I've tried using --exclude-directories
together with --accept
and then with --reject
, but in both attempts it keeps downloading the tmp
dir and its contents.
These are the commands I've tried:
# with --reject
wget -nH --cut-dirs=2 -r --reject=pdf --exclude-directories=tmp \
--no-parent http://<host>/pub/somedir/
# with --accept
wget -nH --cut-dirs=2 -r --accept=txt,jpg --exclude-directories=tmp \
--no-parent http://<host>/pub/somedir/
Is there a way to do this?
How exactly is --exclude-directories
supposed to work?
Best Answer
Rather than try and do this using
wget
I'd suggest using a more appropriate tool for downloading complex "sets" of files or filters.You can use
httrack
to download either entire directories of files (essentially mirror everything from a site) or you can specify tohttrack
a filter along with specific file extensions, such as download only.pdf
files.You can read more about
httrack
's filter capability which is what you'd need to use if you were interested in only downloading files that were named in a specific way.Here are some examples of the wildcard capability:
*[file]
or*[name]
- any filename or name, e.g. not /,? and ; characters*[path]
- any path (and filename), e.g. not ? and ; characters*[a,z,e,r,t,y]
- any letters among a,z,e,r,t,y*[a-z]
- any letters*[0-9,a,z,e,r,t,y]
- any characters among 0..9 and a,z,e,r,t,yExample
The switches are as follows:
-*
- remove everything from list of things to download+1_[a-z].doc
- download files named 1_a.doc, 1_b.doc, etc.-O /dir/to/output
- write results here