The immediate thought is wc
, but then the next not-so-immediate thought is… Is *nix's wc
purely for *nix line endings \x0a?… It seems so.
I've semi-wangled my way around it, but I feel there may/must be a simpler way than working on a hex-dump of the original.
Here is my version, but there is still a mysterious discrepancy in the tallies. wc
reports 1 more 0a
than the sum of this script's CRLF
+ 0a
.
file="nagaricb.nag"
echo Report on CR and LF in UTF-16LE/CR-LF
echo =====================================
cat "$file" | # a useles comment, courtesy of cat
xxd -p -c 2 |
sed -nr '
/0a../{
/0a00/!{
i ‾‾`0a: embedded in non-newline chars
b
}
}
/0d../{
/0d00/!{
i ‾‾`0d: embedded in non-newline chars
b
}
}
/0a00/{
i ‾‾`CR: found stray 0a00
b
}
/0d00/{
N
/0d00\n0a00/{
i ‾‾`CRLF: found as normal newline pairs
b
}
i ‾‾`LF: found stray 0d00
}' |
sort |
uniq -c
echo " ====="
printf ' %s ‾‾`wc\n' $(<"$file" wc -l)
Output
Report on CR and LF in UTF-16LE/CR-LF
=====================================
125 ‾‾`0a: embedded in non-newline chars
407 ‾‾`0d: embedded in non-newline chars
31826 ‾‾`CRLF: found as normal newline pairs
=====
31952 ‾‾`wc
Is there some more standard/simple way to do this?
Best Answer
Here's a perl script that opens files (given as command line arguments) in UTF-16 (endianness detected via BOM), and counts the lines.
(Dies if the BOM is not understood.)