How to make `file` output line break type and encoding for all file types

character encodingnewlinestext;

I often find myself in the need to check the encoding and the line break type of a file.

The file utility seems to be a perfect fit. Unfortunately it doesn't output the desired information for all file types. For example it doesn't do so on .php files.

For example:

file extra.module.php

extra.module.php: PHP script text

But if I rename the file to .txt and remove the PHP starting tag, it outputs both.

file extra.module.txt

extra.module.txt: UTF-8 Unicode C++ program text, with CRLF line
terminators

I already looked at the manpage and I can use the -i switch to force outputting the character encoding, but there seems to be no switch for the line break type.

Is there a simple solution for this without renaming/changing the file first?

Best Answer

Following @don_crissti's comment, it turns out that passing an empty magicfile to file will make it fall back to detect the default ASCII.

So file -m /dev/null extra.module.php will do the trick in this case and will output the desired

extra.module.php: UTF-8 Unicode C++ program text, with CRLF line terminators

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

No, this is not possible: as you mention in your question, a UNIX file name is just a sequence of bytes; the kernel knows nothing about the encoding, which entirely a user-space (i.e., application-level) concept.

In other words, the kernel knows nothing about LANG/LC_*, so it cannot translate.

2. Is it possible to let different file names refer to same file?

You can have multiple directory entries referring to the same file; you can make that through hard links or symbolic links.

Be aware, however, that the file names that are not valid in the current encoding (e.g., your GBK character string when you're working in a UTF-8 locale) will display badly, if at all.

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

You cannot patch the kernel to do this (see 1.), but you could -in theory- patch the C library (e.g., glibc) to perform this translation, and always convert file names to UTF-8 when it calls the kernel, and convert them back to the current encoding when it reads a file name from the kernel.

A simpler approach could be to write an overlay filesystem with FUSE, that just redirects any filesystem request to another location after converting the file name to/from UTF-8. Ideally you could mount this filesystem in ~/trans, and when an access is made to ~/trans/a/GBK/encoded/path then the FUSE filesystem really accesses /a/UTF-8/encoded/path.

However, the problem with these approaches is: what do you do with files that already exist on your filesystem and are not UTF-8 encoded? You cannot just simply pass them untranslated, because then you don't know how to convert them; you cannot mangle them by translating invalid character sequences to ? because that could create conflicts...

Most common encoding for strings in C++ in Linux (and Unix?)

This is just a partial answer, since your question is fairly broad.

C++ defines an "execution character set" (in fact, two of them, a narrow and a wide one).

When your source file contains something like:

char s[] = "Hello";

Then the numeric byte value of the letters in the string literal are simply looked up according to the execution encoding. (The separate wide execution encoding applies to the numeric value assigned to wide character constants L'a'.)

All this happens as part of the initial reading of the source code file into the compilation process. Once inside, C++ characters are nothing more than bytes, with no attached semantics. (The type name char must be one of the most grievous misnomers in C-derived languages!)

There is a partial exception in C++11, where the literals u8"", u"" and U"" determine the resulting value of the string elements (i.e the resulting values are globally unambiguous and platform-independent), but that does not affect how the input source code is interpreted.

A good compiler should allow you to specify the source code encoding, so even if your friend on an EBCDIC machine sends you her program text, that shouldn't be a problem. GCC offers the following options:

-finput-charset: input character set, i.e. how the source code file is encoded
-fexec-charset: execution character set, i.e. how to encode string literals
-fwide-exec-charset: wide execution character set, i.e. how to encode wide string literals

GCC uses iconv() for the conversions, so any encoding supported by iconv() can be used for those options.

I wrote previously about some opaque facilities provided by the C++ standard to handle text encodings.

Example: take the above code, char s[] = "Hello";. Suppose the source file is ASCII (i.e. the input encoding is ASCII). Then the compiler reads 99, and interprets it as c, and so on. When it comes to the literal, it reads 72, interprets it as H. Now it stores the byte value of H in the array which is determined by the execution encoding (again 72 if that is ASCII or UTF-8). When you write \xFF, the compiler reads 99 120 70 70, decodes it as \xFF, and writes 255 into the array.

Best Answer

Related Solutions

Linux Filesystems – Questions About Character Encoding

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

2. Is it possible to let different file names refer to same file?

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

Most common encoding for strings in C++ in Linux (and Unix?)

Related Question