Shell – cat on big files does not work

catcompressionshell

I'm trying to concatenate four big files in two. The files *_1P.gz contain the same amount of lines as the corrisponding *_2P.gz.

The files A_1P.gz and A_2P.gz both contain 1104507560 lines.
The files B_1P.gz and B_2P.gz both contain 1182136972 lines.

However, cat A_1P.gz B_1P.gz > C_1P.gz| wc -l returns 186974687 lines, and cat A_2P.gz B_2P.gz > C_2P.gz| wc -l returns 182952523 lines, so both are not only way smaller than the two input files (they should be more than 2B lines long and they're less than 2M instead), but also they have a different number of lines. The command ran showing no errors whatsoever.

I can't understand what's happening, I generated those four big files with cat as well and it worked properly.

  • What could the problem be?
  • What other options do I have to concatenate gzipped files without using cat?

I'm working on a CentOS server. I still have 197G space, so that shouldn't be an issue (or it should show an error, at least).

Best Answer

Note that the files are compressed. You can't therefore use wc -l on the files directly to count the original number of lines in them without decompressing them first.

It's OK to use cat for concatenating these types of compressed files as the resulting file is a valid compressed file in itself. Uncompressing it later would result in a file that is the concatenation of the uncompressed data from the two files.

cat A_1P.gz B_1P.gz >C_1P.gz

To count the number of lines in C_1P.gz:

zcat C_1P.gz | wc -l

or

gunzip -c C_1P.gz | wc -l

or

gzip -dc C_1P.gz | wc -l

but note that we need to uncompress the file to count the lines, otherwise we'll be counting the "random" newlines that the file compression algorithm generates as part of the compressed data (these have nothing to do with the lines in your uncompressed file).

Related Question