Shell – Gzip large amount of symlinked files

gzipshellsymlink

I have a folder containing a large amount of symlinked files. These files are each on the order of 10-11GB (fastq files to be specific). They come from a variety of source folders, but I made sure there's only one level of symlinks.

I'm trying to gzip them by simply doing:

gzip *.fastq

That results in a bunch of

too many levels of symbolic links

And thus fails.

However, when I do:

for i in `ls | egrep *.fastq$`; do gzip -c $i > $i.gz; done;

it does work. My question is simple. What is the difference between those? AFAIK, the only difference is that the second approach starts a new gzip process for each file, whereas the first one should do everything in one process. Can gzip only handle one symlinked file at a time? Doing the same on a test folder with normal files works both ways.

Best Answer

A quick check of the gzip source (specifically, gzip 1.6 as included in Ubuntu 14.04), shows that the observed behavior comes from the function open_and_stat, beginning at line 1037 of gzip.c:

static int
open_and_stat (char *name, int flags, mode_t mode, struct stat *st)
{
  int fd;

  /* Refuse to follow symbolic links unless -c or -f.  */
  if (!to_stdout && !force)
    {
      if (HAVE_WORKING_O_NOFOLLOW)
        flags |= O_NOFOLLOW;
      else
        {
#if HAVE_LSTAT || defined lstat
          if (lstat (name, st) != 0)
            return -1;
          else if (S_ISLNK (st->st_mode))
            {
              errno = ELOOP;
              return -1;
            }
#endif
        }
    }

  fd = OPEN (name, flags, mode);
  if (0 <= fd && fstat (fd, st) != 0)
    {
      int e = errno;
      close (fd);
      errno = e;
      return -1;
    }
  return fd;
}

Note that the comment line states that gzip will not follow symlinks unless it is called with the -c or -f flags, and inside the #if ... #endif the errno variable is set to ELOOP (too many symbolic links encountered) if the file to be compressed is actually a symlink.

Now, from the gzip(1) man page, the -c and -f flags are:

   -c --stdout --to-stdout
         Write  output  on  standard output; keep original files unchanged.  If there are
         several input files, the output consists of a  sequence  of  independently  com‐
         pressed  members.  To  obtain  better  compression,  concatenate all input files
         before compressing them.


  -f --force
         Force compression or decompression even if the file has multiple  links  or  the
         corresponding  file  already  exists,  or if the compressed data is read from or
         written to a terminal. If the input data is not in a format recognized by  gzip,
         and  if the option --stdout is also given, copy the input data without change to
         the standard output: let zcat behave as cat.  If -f is not given, and  when  not
         running  in  the  background,  gzip  prompts  to verify whether an existing file
         should be overwritten.

Putting all together and going back to the original question:

  • The first example fails because it is trying to compress the actual symlink (even if it is not an actual link loop)
  • The second uses the -c flag, so it is reading the contents of the original file and then writing the compressed output to stdout, so it succeeds.
  • A third scenario is using -f instead of -c. In this case, gzip does not complain when tryng to compress a symlink, but upon decompression it becomes a regular file, as shown:
$ ls -l
total 4
-rw-rw-r-- 1 x86tux x86tux 13 Jun 16 13:10 realfile.txt
lrwxrwxrwx 1 x86tux x86tux 12 Jun 16 23:40 symlink.txt -> realfile.txt
$ gzip symlink.txt
gzip: symlink.txt: Too many levels of symbolic links
$ gzip -f symlink.txt
$ ls -l
total 8
-rw-rw-r-- 1 x86tux x86tux 13 Jun 16 13:10 realfile.txt
-rw-rw-r-- 1 x86tux x86tux 45 Jun 16 13:10 symlink.txt.gz
$ gunzip symlink.txt.gz
$ ls -l
total 8
-rw-rw-r-- 1 x86tux x86tux 13 Jun 16 13:10 realfile.txt
-rw-rw-r-- 1 x86tux x86tux 13 Jun 16 13:10 symlink.txt
$ md5sum *
618f486e0225d305d16d0648ed44b1eb  realfile.txt
618f486e0225d305d16d0648ed44b1eb  symlink.txt
Related Question