Opendir and readdir encoding strings behind the back

character encodingdirectory-structure

(You can skip the details to the last couple of lines if you're able to answer the question 🙂 )

I'm on an Ubuntu 12.04. I'm trying to resolve an old issue that I've posted about in the past (if you're curious: https://superuser.com/questions/339877/trouble-viewing-files-with-non-english-names-on-hard-disk/339895#339895). There is a known compatibility issue between Linux, Mac, HFS+ and Korean-named files, and I spent all day today trying to finally find some kind of workaround.

Basically, I've mounted my HFS+ drive onto linux. Normal ls and cd have trouble accessing the files, because they are in Korean. So I wrote a C program to try to access these files at the lowest level, so I can be more sure that nothing would be happening behind my back:

DIR* dp; 
struct dirent *ep;
char* parent = "/media/external/Movies";
dp = opendir( parent );
if( dp != NULL )
{   
    while( ep = readdir(dp) )
    {   
        printf( "%d %s %X\t", ep->d_ino, ep->d_name, ep->d_type );

    // now print out the filenames in hex
        for( int i = 0; i != strlen( ep->d_name ) ; i++)
        {   
            printf( "0x%X " , ep->d_name[i] & 0xff );
        }   
        printf("\n");
    }   
    closedir(dp);
}
else
{   
     perror("Couldn't open the directory! ");
}   

Here's a sample of the output I get for this:

433949 밀양 4 0xEB 0xB0 0x80 0xEC 0x96 0x91

413680 박쥐 4 0xEB 0xB0 0x95 0xEC 0xA5 0x90

434033 박하사탕 4 0xEB 0xB0 0x95 0xED 0x95 0x98 0xEC 0x82 0xAC 0xED 0x83 0x95

So on the surface, it looks like openddir has no problem viewing the directory entries. The inode numbers are there, they are correctly marked as directories (4 means directory) and it appears that the filenames are stored as UTF-8 encoded, since those hexadecimals are the correct UTF-8 codes for the korean filenames. But now if I were to do a readdir of one of these directories (and I'll be using the filename in hex to be extra careful that nothing's happening behind my back):

unsigned char new_dirname[] = {'/',0xEB,0xB0,0x80,0xEC,0x96,0x91,'\0'};
unsigned char final[ strlen(parent) + strlen(new_dirname) + 1 ];
memcpy(final, parent, strlen( parent )); 
strcpy(final + strlen(parent), dirname );
dp = opendir( final ); // dp == NULL here!!!

It is not able to open the directory. This befuddles me because if opendir was just reporting the raw bits of the file name in the directory entry, and readdir was just taking my given filename and matching it with the correct directory entry, then I would've thought there should be no problem in finding the inode and opening the directory. This seems to suggest that opendir is not being completely honest about the filenames.

Are the file names in the directory entries reported by opendir not what's actually on disk (i.e. are they being encoded)? If so is there any way that I can either control how opendir and readdir are encoding names, or perhaps use some other system calls that works with raw bytes instead of encoding stuff behind my back? In general, I find it very confusing at what level encoding is happening and I'd appreciate any explanations or better yet, a reference to understand this! Thanks!

Best Answer

opendir and readdir themselves work on bytes. They do not perform and reencoding.

Some filesystem drivers may impose contraints on the byte sequences. For example, HFS+ normalizes file names using a proprietary Unicode normalization scheme. I would expect the form returned by readdir to work when passed to opendir, however, so like the OP in the Ubuntu forum thread that jw013 mentioned, I suspect a bug in the HFS+ driver. It is not the only program that is tripped by Hangul on HFS+. Even OSX seems to have trouble with Unicode normalization.

Related Question