Ubuntu – Identify duplicate lines in a file without deleting them

command linesort

I have my references as a text file with a long list of entries and each has two (or more) fields.

The first column is the reference's url; the second column is the title which may vary a bit depending on how the entry was made. Same for third field which may or may not be present.

I want to identify but not remove entries that have the first field (reference url) identical. I know about sort -k1,1 -u but that will automatically (non-interactively) remove all but the first hit. Is there a way to just let me know so I can choose which to retain?

In the extract below of three lines that have the same first field (http://unix.stackexchange.com/questions/49569/), I would like to keep line 2 because it has additional tags (sort, CLI) and delete lines #1 and #3:

http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

Is there a program to help identify such "duplicates"? Then, I can manually clean up by personally deleting lines #1 and #3?

Best Answer

If I understand your question, I think that you need something like:

for dup in $(sort -k1,1 -u file.txt | cut -d' ' -f1); do grep -n -- "$dup" file.txt; done

or:

for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n -- "$dup" file.txt; done

where file.txt is your file containing data about you are interested.

In the output you will see the number of the lines and lines where first field is found two or more times.

Related Solutions

Ubuntu – Searching through ODT documents without opening them

You would need a full text indexing solution, which has a filter to support indexing the full text of those files.

One option for this is the tracker package in Ubuntu. You'll need to install tracker and tracker-miner-fs for this, and you'll also likely want tracker-gui for the search tool UI.

Ubuntu – How does fdupes determine which file to keep and which to delete from set of duplicate files scattered in storage disk

Here's a test I performed:

$ ls -lt -u -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 002/sample2.mp3
$ ls -lt -c -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan  9 23:39 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:14 001/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:20 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:02 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:08 001/sample.mp3
$ ls -t -1r */*.mp3
001/sample0.mp3
001/sample3.mp3
001/sample2.mp3
002/sample2.mp3
001/sample.mp3
$ fdupes -r . | grep mp3
./001/sample0.mp3
./001/sample3.mp3
./001/sample2.mp3
./002/sample2.mp3
./001/sample.mp3
$ touch -a 001/sample2.mp3 
$ ls -lt -u -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:29 001/sample2.mp3
$ ls -lt -c -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan  9 23:39 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:20 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:02 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:08 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:29 001/sample2.mp3
$ ls -t -1r */*.mp3
001/sample0.mp3
001/sample3.mp3
001/sample2.mp3
002/sample2.mp3
001/sample.mp3
$ fdupes -r . | grep mp3
./001/sample0.mp3
./001/sample3.mp3
./001/sample2.mp3
./002/sample2.mp3
./001/sample.mp3
$ touch -m 001/sample3.mp3 
$ ls -lt -u -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 001/sample3.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 11:49 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:32 001/sample2.mp3
$ ls -lt -c -r */*.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan  9 23:39 001/sample0.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 00:20 002/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 10 01:08 001/sample.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:29 001/sample2.mp3
-rwxrwxr-x 1 hash hash 3416208 Jan 11 22:34 001/sample3.mp3
$ ls -t -1r */*.mp3
001/sample0.mp3
001/sample2.mp3
002/sample2.mp3
001/sample.mp3
001/sample3.mp3
$ fdupes -r . | grep mp3
./001/sample0.mp3
./001/sample2.mp3
./002/sample2.mp3
./001/sample.mp3
./001/sample3.mp3
$ fdupes -rd ./001/ ./002/
[1] ./001/sample0.mp3                 
[2] ./001/sample2.mp3
[3] ./002/sample2.mp3
[4] ./001/sample.mp3
[5] ./001/sample3.mp3

Set 1 of 1, preserve files [1 - 5, all]: 4

   [-] ./001/sample0.mp3
   [-] ./001/sample2.mp3
   [-] ./002/sample2.mp3
   [+] ./001/sample.mp3
   [-] ./001/sample3.mp3

Conclusion:

The duplicate files are sorted in reverse order of latest modification time. So, the first file in the set of duplicates is the oldest in term of modification time (mtime).

That means if you use fdupes -rdN [directory] ..., the file with oldest mtime in each set of duplicates will be preserved and the rest will be deleted.

Best Answer

Related Solutions

Ubuntu – Searching through ODT documents without opening them

Ubuntu – How does fdupes determine which file to keep and which to delete from set of duplicate files scattered in storage disk

Conclusion:

References:

Related Question