How to count the number of lines in a UTF-16LE/CR-LF/BOM file

The immediate thought is wc, but then the next not-so-immediate thought is… Is *nix's wc purely for *nix line endings \x0a?… It seems so.

I've semi-wangled my way around it, but I feel there may/must be a simpler way than working on a hex-dump of the original.

Here is my version, but there is still a mysterious discrepancy in the tallies. wc reports 1 more 0a than the sum of this script's CRLF + 0a.

 file="nagaricb.nag"
 echo Report on CR and LF in UTF-16LE/CR-LF
 echo =====================================
 cat "$file" | # a useles comment, courtesy of cat 
   xxd -p -c 2 |
     sed -nr '
       /0a../{
           /0a00/!{
               i ‾‾`0a:   embedded in non-newline chars       
               b
           }
       }
       /0d../{
           /0d00/!{
               i ‾‾`0d:   embedded in non-newline chars       
               b
           }
       }
       /0a00/{
           i ‾‾`CR: found stray 0a00       
           b
        }
       /0d00/{
           N
           /0d00\n0a00/{
               i ‾‾`CRLF: found as normal newline pairs
               b
           }
           i ‾‾`LF: found stray 0d00
        }' |
         sort |
           uniq -c
 echo "  ====="
 printf '  %s ‾‾`wc\n' $(<"$file" wc -l)

Output

Report on CR and LF in UTF-16LE/CR-LF
=====================================
    125 ‾‾`0a:   embedded in non-newline chars       
    407 ‾‾`0d:   embedded in non-newline chars       
  31826 ‾‾`CRLF: found as normal newline pairs
  =====
  31952 ‾‾`wc

Is there some more standard/simple way to do this?

#! /usr/bin/env perl use strict; use warnings; while (my $file = shift @ARGV) { my $fh; if (!open($fh, '<:encoding(UTF-16)', $file)) { print STDERR "Failed to open [$file]: $!\n"; next; } my $count = 0; $count++ while (<$fh>); print "$file: $count\n"; close $fh; }

How to count the number of lines in a UTF-16LE/CR-LF/BOM file

Best Answer

Related Question

Best Answer

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

Text Processing – How to Count Total Number of Words in a File

example

Related Question