Linux – Remove non-duplicate lines in Linux

awklinuxtext manipulationuniq

how can I remove non-duplicate lines from text file using any linux program linke sed, awk or any other?





Second list have removed ccc because it didn't have duplicate lines.

Is it also possible to remove lines, that are non-duplicate AND lines that have only 2 duplicates, and leave those who have more then 2 duplicates lines?

Best Answer

The solutions posted by others do not work on my Debian Jessie: they keep a single copy of any duplicate line, while it is my understanding of the OP that all copies of the duplicate lines are to be kept. If I have understood the OP right, then ...

  1. The following command

    awk '!seen[$0]++' file

    removes all duplicate lines.

  2. The following command

    awk 'seen[$0]++' file 

    outputs all the duplicates, but not the original copy: i.e., if a line appears n times, it outputs the line n-1 times.

  3. Then the command

    awk 'seen[$0]++' file > temp && awk '!seen[$0]++' file >> temp

    solves your problem. The lines are not in the original order.

  4. If you want lines which have two or more duplicates, you can now iterate the above:

    awk 'seen[$0]++' file | awk 'seen[$0]++' > temp

    keeps n-2 copies of the lines which have n>1 duplicates. Now

    awk '!seen[$0]++' temp > temp1 

    removes all duplicate lines from the temp file, and you can now obtain what you wish (i.e. only the lines with n>1 duplicates) as follows:

    cat temp1 >> temp; cat temp1 >> temp
  5. If you need to do this for lines which appear N or more times, the following command

      awk 'seen[$0]++ && seen[$0] > N' file 

    is simpler than chaining N times the command awk 'seen[$0]++' file.

Related Question