Shell – cat on big files does not work

catcompressionshell

I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.

The files A_1P.gz and A_2P.gz both contain 1104507560 lines.
The files B_1P.gz and B_2P.gz both contain 1182136972 lines.

However, cat A_1P.gz B_1P.gz > C_1P.gz| wc -l returns 186974687 lines, and cat A_2P.gz B_2P.gz > C_2P.gz| wc -l returns 182952523 lines, so both are not only way smaller than the two input files (they should be more than 2B lines long and they're less than 2M instead), but also they have a different number of lines. The command ran showing no errors whatsoever.

I can't understand what's happening, I generated those four big files with cat as well and it worked properly.

What could the problem be?
What other options do I have to concatenate gzipped files without using cat?

I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).

Best Answer

Note that the files are compressed. You can't therefore use wc -l on the files directly to count the original number of lines in them without decompressing them first.

It's OK to use cat for concatenating these types of compressed files as the resulting file is a valid compressed file in itself. Uncompressing it later would result in a file that is the concatenation of the uncompressed data from the two files.

cat A_1P.gz B_1P.gz >C_1P.gz

To count the number of lines in C_1P.gz:

zcat C_1P.gz | wc -l

gunzip -c C_1P.gz | wc -l

gzip -dc C_1P.gz | wc -l

but note that we need to uncompress the file to count the lines, otherwise we'll be counting the "random" newlines that the file compression algorithm generates as part of the compressed data (these have nothing to do with the lines in your uncompressed file).

Related Solutions

(cp is to cat AS mv is to ?) mv multiple files into one file instead of cat * rm *

The largest obstacle to a tool like this existing is that unless each file's size (except the last one) being concatenated is exactly divisible by the block size (I'm a little uncertain about the right terminology here), you'll end up with "gaps" with garbage data between your concatenated files in the final file.

This is because file data is typically stored in blocks with specific sizes on the file system, such that a 618 byte file stored on a file system using 32 byte blocks would take up 618 / 32 = 19.3125 blocks, i.e. 19 full blocks, and about 1/3 of an additional block.

Assuming you wanted to combine two files like this without regarding my obstacle, you'd simply point the "new file" to the blocks of the first file, plus the blocks of the second file, right?

With that naïve approach, you'd end up with a file of 40 blocks, with its block 20 being 1/3 sensible and 2/3 garbage, and block 21 starting the second file's data.

With some file formats, you might be able to do some clever calculations and manipulations of file headers to basically tell the application that will be using the file to skip the garbage parts, but that's more of a band-aid solution than a proper one.

Echo vs Cat – Understanding Execution Time Differences

There are several things to consider here.

i=`cat input`

can be expensive and there's a lot of variations between shells.

That's a feature called command substitution. The idea is to store the whole output of the command minus the trailing newline characters into the i variable in memory.

To do that, shells fork the command in a subshell and read its output through a pipe or socketpair. You see a lot of variation here. On a 50MiB file here, I can see for instance bash being 6 times as slow as ksh93 but slightly faster than zsh and twice as fast as yash.

The main reason for bash being slow is that it reads from the pipe 128 bytes at a time (while other shells read 4KiB or 8KiB at a time) and is penalised by the system call overhead.

zsh needs to do some post-processing to escape NUL bytes (other shells break on NUL bytes), and yash does even more heavy-duty processing by parsing multi-byte characters.

All shells need to strip the trailing newline characters which they may be doing more or less efficiently.

Some may want to handle NUL bytes more gracefully than others and check for their presence.

Then once you have that big variable in memory, any manipulation on it generally involves allocating more memory and coping data across.

Here, you're passing (were intending to pass) the content of the variable to echo.

Luckily, echo is built-in in your shell, otherwise the execution would have likely failed with an arg list too long error. Even then, building the argument list array will possibly involve copying the content of the variable.

The other main problem in your command substitution approach is that you're invoking the split+glob operator (by forgetting to quote the variable).

For that, shells need to treat the string as a string of characters (though some shells don't and are buggy in that regard) so in UTF-8 locales, that means parsing UTF-8 sequences (if not done already like yash does), look for $IFS characters in the string. If $IFS contains space, tab or newline (which is the case by default), the algorithm is even more complex and expensive. Then, the words resulting from that splitting need to be allocated and copied.

The glob part will be even more expensive. If any of those words contain glob characters (*, ?, [), then the shell will have to read the content of some directories and do some expensive pattern matching (bash's implementation for instance is notoriously very bad at that).

If the input contains something like /*/*/*/../../../*/*/*/../../../*/*/*, that will be extremely expensive as that means listing thousands of directories and that can expand to several hundred MiB.

Then echo will typically do some extra processing. Some implementations expand \x sequences in the argument it receives, which means parsing the content and probably another allocation and copy of the data.

On the other hand, OK, in most shells cat is not built-in, so that means forking a process and executing it (so loading the code and the libraries), but after the first invocation, that code and the content of the input file will be cached in memory. On the other hand, there will be no intermediary. cat will read large amounts at a time and write it straight away without processing, and it doesn't need to allocate huge amount of memory, just that one buffer that it reuses.

It also means that it's a lot more reliable as it doesn't choke on NUL bytes and doesn't trim trailing newline characters (and doesn't do split+glob, though you can avoid that by quoting the variable, and doesn't expand escape sequence though you can avoid that by using printf instead of echo).

If you want to optimise it further, instead of invoking cat several times, just pass input several times to cat.

yes input | head -n 100 | xargs cat

Will run 3 commands instead of 100.

To make the variable version more reliable, you'd need to use zsh (other shells can't cope with NUL bytes) and do it:

zmodload zsh/mapfile
var=$mapfile[input]
repeat 10 print -rn -- "$var"

If you know the input doesn't contain NUL bytes, then you can reliably do it POSIXly (though it may not work where printf is not builtin) with:

i=$(cat input && echo .) || exit # add an extra .\n to avoid trimming newlines
i=${i%.} # remove that trailing dot (the \n was removed by cmdsubst)
n=10
while [ "$n" -gt 10 ]; do
  printf %s "$i"
  n=$((n - 1))
done

But that is never going to be more efficient than using cat in the loop (unless the input is very small).

Best Answer

Related Solutions

(cp is to cat AS mv is to ?) mv multiple files into one file instead of cat * rm *

Echo vs Cat – Understanding Execution Time Differences

Related Question