Why is sorted uniq -c command showing duplicates

sedsortuniq

I am trying to count how many times I use a certain version of a library on my computer.

For some reason, uniq -c is outputing duplicates, despite sorting it, and despite the sort order seeming in order.

Any ideas or feedback?

Thanks for your time.

With uniq -c

Input:

rg --no-line-number --no-filename -g '*.csproj' "GitVersion.MsBuild" | sed -E '/GitVersion\.MsBuild" Version/!d;s/^\s\+//g;/<!/d;s/^.+(GitVersion.MsBuild)" Version="(.+)">/\1   \2/g' | sort -n | uniq -c

Output:

      3 GitVersion.MsBuild      5.10.1
      1 GitVersion.MsBuild      5.10.1
      3 GitVersion.MsBuild      5.10.3
     11 GitVersion.MsBuild      5.11.1
      5 GitVersion.MsBuild      5.11.1
     25 GitVersion.MsBuild      5.12.0
      2 GitVersion.MsBuild      5.12.0
      1 GitVersion.MsBuild      5.6.11
      2 GitVersion.MsBuild      5.7.0
      4 GitVersion.MsBuild      5.8.1

Without uniq -c

Input:

rg --no-line-number --no-filename -g '*.csproj' "GitVersion.MsBuild" | sed -E '/GitVersion\.MsBuild" Version/!d;s/^\s\+//g;/<!/d;s/^.+(GitVersion.MsBuild)" Version="(.+)">/\1   \2/g' | sort -n

Output:

GitVersion.MsBuild      5.10.1
GitVersion.MsBuild      5.10.1
GitVersion.MsBuild      5.10.1
GitVersion.MsBuild      5.10.1
GitVersion.MsBuild      5.10.3
GitVersion.MsBuild      5.10.3
GitVersion.MsBuild      5.10.3
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.11.1
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.12.0
GitVersion.MsBuild      5.6.11
GitVersion.MsBuild      5.7.0
GitVersion.MsBuild      5.7.0
GitVersion.MsBuild      5.8.1
GitVersion.MsBuild      5.8.1
GitVersion.MsBuild      5.8.1
GitVersion.MsBuild      5.8.1

I've updated my command to pipe to xxd as per @kos's suggestion. That helped in comparing.

rg --no-line-number --no-filename -g '*.csproj' "GitVersion.MsBuild" | sed -E '/GitVersion\.MsBuild" Version/!d;s/^\s\+//g;/<!/d;s/^.+(GitVersion.MsBuild)" Version="([0-9\.]+)">/\1     \2/g' | sort -n | uniq -c | xxd

That yielded (sorry for the screenshot, but it helps having the colors).
enter image description here

I then revised the regex slightly (sorry all, I didn't take all the suggestions on board, since one tiny tweak made it work, but I do have to say I learnt a lot by this, including using xxd)

I simply added .* after the >:

rg --no-line-number --no-filename -g '*.csproj' "GitVersion.MsBuild" | sed -E '/GitVersion\.MsBuild" Version/!d;s/^\s\+//g;/<!/d;s/^.+(GitVersion.MsBuild)" Version="([0-9\.]+)">.*$/\1  \2/g' | sort | uniq -c

And it now yields the correct (or satisfactory anyway) output:

      4 GitVersion.MsBuild      5.10.1
      3 GitVersion.MsBuild      5.10.3
     16 GitVersion.MsBuild      5.11.1
     27 GitVersion.MsBuild      5.12.0
      1 GitVersion.MsBuild      5.6.11
      2 GitVersion.MsBuild      5.7.0
      4 GitVersion.MsBuild      5.8.1

Thanks team!

Best Answer

uniq -c counts the lengths of sequences of consecutive lines that collate equally in the user's locale (for which strcoll(line1, line2) returns 0).

If you get:

      3 GitVersion.MsBuild      5.10.1
      1 GitVersion.MsBuild      5.10.1

With two seemingly identical lines being consecutive, that can only suggest they are not identical (and don't collate equally).

The most likely explanation is that there are variations in invisible characters in there.

Being Microsoft related files, there are likely CR characters at the end of the lines, though that could also be spaces or tabs which can also naturally occur in XML files.

Your code should likely be:

rg --no-line-number --no-filename -g '*.csproj' 'GitVersion\.MsBuild' |
  sed -nE '/<!/d
           s/^.*(GitVersion\.MsBuild)" Version="(.+)">.*/\1   \2/p' |
  sort -V |
  uniq -c

Where:

  • s/^\s\+//g removed as it doesn't serve any purpose (even if fixed to s/^\s+//).
  • sort -n (which is pointless as lines don't start with a number) replaced with sort -V (for version sort, a GNU extension).
  • .* is added after "> so anything after it is discarded including space, tab, CR or any other invisible characters.
  • the g is removed as that pattern can only match once.
  • sed -n in combination with the p flag of the s sed command is used to make sure only the lines with a match are printed.

Or do the whole thing in rg:

rg --pcre2 \
   --no-line-number \
   --no-filename \
   --iglob='*.csproj' \
   --replace=$'$1\t$2' \
   --regexp='^(?!.*<!).*(GitVersion\.MsBuild)"\s+Version="(.*?)".*' |
  sort -V |
  uniq -c

Or pcregrep/pcre2grep:

pcre2grep -hr -o{1,2} --om-separator=$'\t' --include='(?i)\.csproj\z' \
          '^(?!.*<!).*(GitVersion\.MsBuild)"\s+Version="(.*?)"' . |
  sort -V |
  uniq -c

(using --iglob and (?i) as I'm told Microsoft systems tend not to care about case in file names)

If those are indeed XML files, you could also process them with XML aware utilities such as xq:

find . -iname '*.csproj' -type f -exec xq -r '
  ..|
  objects|
  select(."@Include"? == "GitVersion.MsBuild")|
  [."@Include", ."@Version"]|
  join("\t")' {} + |
  sort -V |
  uniq -c
Related Question