Ubuntu – Identify duplicate lines in a file without deleting them

command linesort

I have my references as a text file with a long list of entries and each has two (or more) fields.

The first column is the reference's url; the second column is the title which may vary a bit depending on how the entry was made. Same for third field which may or may not be present.

I want to identify but not remove entries that have the first field (reference url) identical. I know about sort -k1,1 -u but that will automatically (non-interactively) remove all but the first hit. Is there a way to just let me know so I can choose which to retain?

In the extract below of three lines that have the same first field (http://unix.stackexchange.com/questions/49569/), I would like to keep line 2 because it has additional tags (sort, CLI) and delete lines #1 and #3:

http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

Is there a program to help identify such "duplicates"? Then, I can manually clean up by personally deleting lines #1 and #3?

Best Answer

If I understand your question, I think that you need something like:

for dup in $(sort -k1,1 -u file.txt | cut -d' ' -f1); do grep -n -- "$dup" file.txt; done

or:

for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n -- "$dup" file.txt; done

where file.txt is your file containing data about you are interested.

In the output you will see the number of the lines and lines where first field is found two or more times.