Shell Script – How to Check Files for Conventions

character encodingfilesshell-scriptUtilities

I need a script that checks and lists all text files for project conventions. With conventions I mean for example:

UTF-8 encoding
No trailing white spaces
Newline at the end of file
No not-ascii chars
LF for line endings

I do not want to reinvent the wheel. Maybe there is a tool doing this. Do you know some?

Best Answer

Detecting

UTF-8 encoding: file will usually give you the encoding:
```
file --brief --mime-encoding myfile.txt
```
Note that it may either be 'us-ascii' or 'utf-8', depending on whether it finds some UTF-8 characters, so you'll need to accept both.

The following points will mostly require you to pipe the output into wc -l (to count the number of lines of the output) and check whether it's 0 or not. Alternatively, they should usually have a return value of 0 if they found something, or 1 if not (in which case your requirements are fulfilled):

No trailing white spaces: That's a job for grep, I guess:
```
grep -e '\s\+$' myfile.txt
```
Newline at the end of file: If the last character according to hexdump or xxd is 0a, there is a newline, and it's fine:
```
xxd -ps myfile.txt | tail -n 1 | grep '0a$'
```
(note that, unlike for the other points denoted here, you want this to find something)
No not-ascii chars: This is the same as "UTF-8 encoding", except maybe a little more strict.

If you really want to be sure there are only ASCII characters in a file (see @Anthon's answer), you'll probably need something like
```
xxd -g1 myfile.txt | cut -c 10-56 | grep '[a-f89][a-f0-9]'
```
This searches for any characters outside the ASCII range (0x00-0x7F). It's not very elegant, though.
LF for line endings: file without any options will tell you something like
```
ASCII text, with CRLF line terminators
```
For a script, probably something like the following could do:
```
xxd -g1 myfile.txt | cut -c 10-56 | grep '0d'
```

Fixing

UTF-8 encoding: There is iconv (1). Essentially it takes a "from" encoding (-f), a "to" encoding (-t), and the file. The "to" encoding is probably utf-8, whereas the "from" encoding can be obtained using file as described at the top of my post:
```
file_encoding="$(file --brief --mime-encoding myfile.txt)"
iconv -f "$file_encoding" -t 'utf-8' myfile.txt
```
No trailing white spaces: That's a job for sed, although I prefer the POSIX way (i.e. no -i switch) which means using printf + ex. Either way, the regex will look something like s/\s\+$//g. For the POSIX-compliant way:
```
printf "%%s/\\s\\+\$//g\nwq\n" | ex -s myfile.txt
```
For the non-POSIX-compliant way:
```
sed -i 's/\s\+$//g' myfile.txt
```
Newline at the end of file: Unix applications usually append a missing newline at the end of file when they save it. To exploit that, this is bit of a hack:
```
printf "wq\n" | ex -s myfile.txt
```
(this will actually just open, save, quit)
No not-ascii chars: See "UTF-8 encoding" above.
LF for line endings: There is dos2unix (1). It should do exactly what you need.

Related Solutions

Linux – get one more number of bytes reported in a file

vi and many other editors add an extra line feed to the end of the file even if you don't add one manually. For example, writing aabbcd in vi, not pressing return and saving gives a file that od -x dumps as;

0000000      6161    6262    6463    000a
0000007

...which is (a little endian dump of) aabbcd + line feed.

ls -l will also show the file as 7 bytes;

$ ls -l testfile.txt 
-rw-r--r--  1 me  staff  7 Jul 26 09:52 testfile.txt

Most likely, this is the case with your file too.

Bash – Sed? Stripping all formatting, i.e line breaks and whitespaces, from a report text file, while masking out certain pieces

I am not quite sure what you're trying to do. Using your first input file, I create this output:

LUNName=mysql05-dbdat02|CollectionStartTime=2012-09-20T15:43:03-04:00|CollectionEndTime=2012-09-20T15:43:34-04:00|Capacity|CurrentCapacity=512|IOOperations|Reads=100|Writes=0|ReadsPerSecond=0.000000|WritesPerSecond=0.000000|ReadMBPerSecond=0.000|WriteMBPerSecond=0.000|TotalMBPerSecond=0.000|NonOptimizedIOPerSecond=0.000000|CacheHitPercentage=0.000|PerformanceMetrics|TotalIOsPerSecond=0.000|ReadIOsPerSecond=0.000|WriteIOsPerSecond=0.000|TotalMBPerSecond=0.000|ReadMBPerSecond=0.000|WriteMBPerSecond=0.000|Performance|

With this perl one liner:

perl -pe 's/\n/|/;s/\s*//g; s/:/=/; END{print "\n"}' file

You could also do it with this:

sed -r 's/\s*//g; s/:/=/;' file | tr '\n' '|'