I need a script that checks and lists all text files for project conventions. With conventions I mean for example:
- UTF-8 encoding
- No trailing white spaces
- Newline at the end of file
- No not-ascii chars
- LF for line endings
I do not want to reinvent the wheel. Maybe there is a tool doing this. Do you know some?
Best Answer
Detecting
UTF-8 encoding:
file
will usually give you the encoding:Note that it may either be 'us-ascii' or 'utf-8', depending on whether it finds some UTF-8 characters, so you'll need to accept both.
The following points will mostly require you to pipe the output into
wc -l
(to count the number of lines of the output) and check whether it's 0 or not. Alternatively, they should usually have a return value of 0 if they found something, or 1 if not (in which case your requirements are fulfilled):No trailing white spaces: That's a job for
grep
, I guess:Newline at the end of file: If the last character according to
hexdump
orxxd
is0a
, there is a newline, and it's fine:(note that, unlike for the other points denoted here, you want this to find something)
No not-ascii chars: This is the same as "UTF-8 encoding", except maybe a little more strict.
If you really want to be sure there are only ASCII characters in a file (see @Anthon's answer), you'll probably need something like
This searches for any characters outside the ASCII range (0x00-0x7F). It's not very elegant, though.
LF for line endings:
file
without any options will tell you something likeFor a script, probably something like the following could do:
Fixing
UTF-8 encoding: There is
iconv (1)
. Essentially it takes a "from" encoding (-f
), a "to" encoding (-t
), and the file. The "to" encoding is probablyutf-8
, whereas the "from" encoding can be obtained usingfile
as described at the top of my post:No trailing white spaces: That's a job for
sed
, although I prefer the POSIX way (i.e. no-i
switch) which means usingprintf
+ex
. Either way, the regex will look something likes/\s\+$//g
. For the POSIX-compliant way:For the non-POSIX-compliant way:
Newline at the end of file: Unix applications usually append a missing newline at the end of file when they save it. To exploit that, this is bit of a hack:
(this will actually just open, save, quit)
No not-ascii chars: See "UTF-8 encoding" above.
LF for line endings: There is
dos2unix (1)
. It should do exactly what you need.