Zip – Understanding the Zip Format’s External File Attribute

zip

This is a slightly exotic question, but there doesn't seem to be much information on the net about this. I just added an answer to a question about the zip format's external file attribute. As you can see from my answer, I conclude that only the second byte (of 4 bytes) is actually used for Unix. Apparently this contains enough information when unzipping to deduce whether the object is a file or a directory, and also has space for other permission and attribute information. My question is, how does this map to the usual Unix permissions? Do the usual Unix permissions (e.g. below) that ls gives fit into exactly one byte, and if so, can someone describe the layout or give a reference, please?

$ ls -la
total 36
drwxr-xr-x   3 faheem faheem  4096 Jun 10 01:11 .
drwxrwxrwt 136 root   root   28672 Jun 10 01:07 ..
-rw-r--r--   1 faheem faheem     0 Jun 10 01:07 a
drwxr-xr-x   2 faheem faheem  4096 Jun 10 01:07 b
lrwxrwxrwx   1 faheem faheem     1 Jun 10 01:11 c -> b

Let me make this more concrete by asking a specific question. Per the Trac patch quoted in my answer above, you can create a zip file with the snippet of Python below.

The 040755 << 16L value corresponds to the creation of an empty directory with the permissions drwxr-xr-x. (I tested it). I recognize 0755 corresponds to the rwxr-xr-x pattern, but what about the 04, and how does the whole value correspond to a byte? I also recognize << 16L corresponds to a bitwise left shift of 16 places, which would make it end up as the second from top byte.

def makezip1():
    import zipfile
    z = zipfile.ZipFile("foo.zip", mode = 'w')
    zfi = zipfile.ZipInfo("foo/empty/")
    zfi.external_attr = 040755 << 16L # permissions drwxr-xr-x
    z.writestr(zfi, "")
    print z.namelist()
    z.close()

EDIT: On rereading this, I think that my conclusion that the Unix permissions only correspond to one byte may be incorrect, but I'll let the above stand for the present, since I'm not sure what the correct answer is.

EDIT2: I was indeed incorrect about the Unix values only corresponding to 1 byte. As @Random832 explained, it uses both of the top two bytes. Per @Random832's answer, we can construct the desired 040755 value from the tables he gives below. Namely:

__S_IFDIR + S_IRUSR + S_IWUSR + S_IXUSR + S_IRGRP + S_IXGRP + S_IROTH + S_IXOTH
0040000   + 0400    + 0200    + 0100    + 0040    + 0010    + 0004    + 0001
= 40755 

The addition here is in base 8.

Best Answer

0040000 is the traditional value of S_IFDIR, the file type flag representing a directory. The type uses the top 4 bits of the 16-bit st_mode value, 0100000 is the value for regular files.

The high 16 bits of the external file attributes seem to be used for OS-specific permissions. The Unix values are the same as on traditional unix implementations. Other OSes use other values. Information about the formats used in a variety of different OSes can be found in the Info-ZIP source code (download or e.g in debian apt-get source [zip or unzip]) - relevant files are zipinfo.c in unzip, and the platform-specific files in zip.

These are conventionally defined in octal (base 8); this is represented in C and python by prefixing the number with a 0.

These values can all be found in <sys/stat.h> - link to 4.4BSD version. These are not in the POSIX standard (which defines test macros instead); but originate from AT&T Unix and BSD. (in GNU libc / Linux, the values themselves are defined as __S_IFDIR etc in bits/stat.h, though the kernel header might be easier to read - the values are all the same pretty much everywhere.)

#define S_IFIFO  0010000  /* named pipe (fifo) */
#define S_IFCHR  0020000  /* character special */
#define S_IFDIR  0040000  /* directory */
#define S_IFBLK  0060000  /* block special */
#define S_IFREG  0100000  /* regular */
#define S_IFLNK  0120000  /* symbolic link */
#define S_IFSOCK 0140000  /* socket */

And of course, the other 12 bits are for the permissions and setuid/setgid/sticky bits, the same as for chmod:

#define S_ISUID 0004000 /* set user id on execution */
#define S_ISGID 0002000 /* set group id on execution */
#define S_ISTXT 0001000 /* sticky bit */
#define S_IRWXU 0000700 /* RWX mask for owner */
#define S_IRUSR 0000400 /* R for owner */
#define S_IWUSR 0000200 /* W for owner */
#define S_IXUSR 0000100 /* X for owner */
#define S_IRWXG 0000070 /* RWX mask for group */
#define S_IRGRP 0000040 /* R for group */
#define S_IWGRP 0000020 /* W for group */
#define S_IXGRP 0000010 /* X for group */
#define S_IRWXO 0000007 /* RWX mask for other */
#define S_IROTH 0000004 /* R for other */
#define S_IWOTH 0000002 /* W for other */
#define S_IXOTH 0000001 /* X for other */
#define S_ISVTX 0001000 /* save swapped text even after use */

As a historical note, the reason 0100000 is for regular files instead of 0 is that in very early versions of unix, 0 was for 'small' files (these did not use indirect blocks in the filesystem) and the high bit of the mode flag was set for 'large' files which would use indirect blocks. The other two types using this bit were added in later unix-derived OSes, after the filesystem had changed.

So, to wrap up, the overall layout of the extended attributes field for Unix is

TTTTsstrwxrwxrwx0000000000ADVSHR
^^^^____________________________ file type as explained above
    ^^^_________________________ setuid, setgid, sticky
       ^^^^^^^^^________________ permissions
                ^^^^^^^^________ This is the "lower-middle byte" your post mentions
                        ^^^^^^^^ DOS attribute bits
Related Question