Linux – Remove non-duplicate lines in Linux

awklinuxtext manipulationuniq

how can I remove non-duplicate lines from text file using any linux program linke sed, awk or any other?

Example:

abc
bbc
abc
bbc
ccc
bbc

Result:

abc
bbc
abc
bbc
bbc

Second list have removed ccc because it didn't have duplicate lines.

Is it also possible to remove lines, that are non-duplicate AND lines that have only 2 duplicates, and leave those who have more then 2 duplicates lines?

Best Answer

The solutions posted by others do not work on my Debian Jessie: they keep a single copy of any duplicate line, while it is my understanding of the OP that all copies of the duplicate lines are to be kept. If I have understood the OP right, then ...

  1. The following command

    awk '!seen[$0]++' file
    

    removes all duplicate lines.

  2. The following command

    awk 'seen[$0]++' file 
    

    outputs all the duplicates, but not the original copy: i.e., if a line appears n times, it outputs the line n-1 times.

  3. Then the command

    awk 'seen[$0]++' file > temp && awk '!seen[$0]++' file >> temp
    

    solves your problem. The lines are not in the original order.

  4. If you want lines which have two or more duplicates, you can now iterate the above:

    awk 'seen[$0]++' file | awk 'seen[$0]++' > temp
    

    keeps n-2 copies of the lines which have n>1 duplicates. Now

    awk '!seen[$0]++' temp > temp1 
    

    removes all duplicate lines from the temp file, and you can now obtain what you wish (i.e. only the lines with n>1 duplicates) as follows:

    cat temp1 >> temp; cat temp1 >> temp
    
  5. If you need to do this for lines which appear N or more times, the following command

      awk 'seen[$0]++ && seen[$0] > N' file 
    

    is simpler than chaining N times the command awk 'seen[$0]++' file.

Related Question