Avoiding fragmentation
The secret is to not write uncompressed files on the disk to begin with.
Indeed, after you compress an already existing large file it will become horrendously fragmented due to the nature of the NTFS in-place compression algorithm.
Instead, you can avoid this drawback altogether by making OS compress a file's content on-the-fly, before writing it to the disk. This way compressed files will be written to the disk as any normal files - without unintentional gaps. For this purpose you need to create a compressed folder. (The same way you mark files to be compressed, you can mark folders to be compressed.) Afterwards, all files written to that folder will be compressed on the fly (i.e. written as streams of compressed blocks). Files compressed this way can still end up being somewhat fragmented, but it will be a far cry from the mess that in-place NTFS compression creates.
Example
NTFS compressed 232Mb system image to 125Mb:
- In-place compression created whopping 2680 fragments!
- On-the-fly compression created 19 fragments.
Defragmentation
It's true that NTFS compressed files can pose a problem to some defragment tools. For example, a tool I normally use can't efficiently handle them - it slows down to a crawl. Fret not, the old trusty Contig from Sysinternals does the job of defragmenting NTFS compressed files quickly and effortlessly!
Given your details, I assume that you have verified that your files really have 99% of data in common, with a contiguous (or almost contiguous) 1% of difference in them.
First, you should use tar to make one archive with your files inside it. For tests, I would create a .tar with 10 files, so having a 300MB size.
Then, using xz, you have to set it so that the dictionary is bigger than the size of one file. Since you don't say if you have memory restrictions, I'd go with xz -9. There's no point in not using all available memory.
I'd also use the --extreme preset, to test if it makes difference.
Dictionary size
In one documentation that I have available - site - it's said that the dictionary size is roughly equal to the decompressor memory usage. And the -1 parameter means a dict of
1MiB, -6 means 10 MiB (or 8 MiB in another part of the same manual). That's why you're not getting any advantage by tarring those files together. Using the -9 would make the decompessor (and, so, dictionary) be 64 MiB, and I think that is what you wanted.
Edit
Another possibility would be using another compressor. I'd go with 7zip, but would tar those files first and then 7zip them.
Depending on your files content, perhaps you could use 7zip with PPM-D method (instead of LZMA or LZMA2, that is the default and the same used by xz)
Not good: Zip (dict = 32kB), Bzip (dict = 900 kB).
Best Answer
Use a forensic analysis software like GUYMAGER (open source, sourceforge.net). It has a nice UI which allows to quickly create a compressed disk image of an entire hard disk.
Use "Advanced forensic image (.aff)" This creates a single, compressed file (well, it also creates an .info file).
To modify the default compression rate 1 (fastest, but least compression). If you have a fast computer with lots of cores, you can change this by creating
/etc/guymager/local.cfg
:9
is the best but slowest compression.3
gives a good compression with good performance.Update
Mounting isn't as simple as it seems. First of all, you need AFFLIB (Ubuntu:
aptitude install afflib-tools
). Now you can get the raw disk image with[affuse][3] <image> <mount-point>
But for some reason, mounting the raw image fails.
parted
says the first partition starts with1048576B
butfails with the usual useless mount error:
and
dmesg
says: