Shell Script – How to Check Files for Conventions

character encodingfilesshell-scriptUtilities

I need a script that checks and lists all text files for project conventions. With conventions I mean for example:

  • UTF-8 encoding
  • No trailing white spaces
  • Newline at the end of file
  • No not-ascii chars
  • LF for line endings

I do not want to reinvent the wheel. Maybe there is a tool doing this. Do you know some?

Best Answer

Detecting

  • UTF-8 encoding: file will usually give you the encoding:

    file --brief --mime-encoding myfile.txt

    Note that it may either be 'us-ascii' or 'utf-8', depending on whether it finds some UTF-8 characters, so you'll need to accept both.

The following points will mostly require you to pipe the output into wc -l (to count the number of lines of the output) and check whether it's 0 or not. Alternatively, they should usually have a return value of 0 if they found something, or 1 if not (in which case your requirements are fulfilled):

  • No trailing white spaces: That's a job for grep, I guess:

    grep -e '\s\+$' myfile.txt
  • Newline at the end of file: If the last character according to hexdump or xxd is 0a, there is a newline, and it's fine:

    xxd -ps myfile.txt | tail -n 1 | grep '0a$'

    (note that, unlike for the other points denoted here, you want this to find something)

  • No not-ascii chars: This is the same as "UTF-8 encoding", except maybe a little more strict.

    If you really want to be sure there are only ASCII characters in a file (see @Anthon's answer), you'll probably need something like

    xxd -g1 myfile.txt | cut -c 10-56 | grep '[a-f89][a-f0-9]'

    This searches for any characters outside the ASCII range (0x00-0x7F). It's not very elegant, though.

  • LF for line endings: file without any options will tell you something like

    ASCII text, with CRLF line terminators

    For a script, probably something like the following could do:

    xxd -g1 myfile.txt | cut -c 10-56 | grep '0d'

Fixing

  • UTF-8 encoding: There is iconv (1). Essentially it takes a "from" encoding (-f), a "to" encoding (-t), and the file. The "to" encoding is probably utf-8, whereas the "from" encoding can be obtained using file as described at the top of my post:

    file_encoding="$(file --brief --mime-encoding myfile.txt)"
    iconv -f "$file_encoding" -t 'utf-8' myfile.txt
  • No trailing white spaces: That's a job for sed, although I prefer the POSIX way (i.e. no -i switch) which means using printf + ex. Either way, the regex will look something like s/\s\+$//g. For the POSIX-compliant way:

    printf "%%s/\\s\\+\$//g\nwq\n" | ex -s myfile.txt

    For the non-POSIX-compliant way:

    sed -i 's/\s\+$//g' myfile.txt
  • Newline at the end of file: Unix applications usually append a missing newline at the end of file when they save it. To exploit that, this is bit of a hack:

    printf "wq\n" | ex -s myfile.txt

    (this will actually just open, save, quit)

  • No not-ascii chars: See "UTF-8 encoding" above.

  • LF for line endings: There is dos2unix (1). It should do exactly what you need.