Linux command to remove the duplicate lines but keep the first occurrence

command linelinuxstring manipulationUbuntu

I have a text file. Each line contains a string. Some strings are repeated. I want to remove repetition but I want to keep the first occurrence. For example:

line1
line1
line2
line3
line4
line3
line5

Should be

line1
line2
line3
line4
line5

I tried: sort file1 | uniq -u > file2 but this did not help. It removed all repeated strings while I want the first occurrence to be present. I do not need to sort. Just remove the exact repetition of a string in a new line while keeping everything else as it is.

Best Answer

If you allow sorting anyway, this will work:

sort | uniq

-u was the source of your trouble, because (from man 1 uniq):

-u, --unique
only print unique lines

while by default:

With no options, matching lines are merged to the first occurrence.

A Few Shortcuts

(based on your comment update for setting $HOSTNAME)

$HOSTNAME

Two options to set that:

Set HOSTNAME

HOSTNAME=$(hostname)
Use command substitution (e.g. $(command))

So it would look like above. That just makes the command run before using it.

$DATE

Another variable avoided would be easily:

$(hostname)_$(date +%Y%m%d).tar.gz \

$ man date will have the formats for the date options, the above is YYYYmmdd

Linux – Remove non-duplicate lines in Linux

The solutions posted by others do not work on my Debian Jessie: they keep a single copy of any duplicate line, while it is my understanding of the OP that all copies of the duplicate lines are to be kept. If I have understood the OP right, then ...

The following command
```
awk '!seen[$0]++' file
```
removes all duplicate lines.
The following command
```
awk 'seen[$0]++' file 
```
outputs all the duplicates, but not the original copy: i.e., if a line appears n times, it outputs the line n-1 times.
Then the command
```
awk 'seen[$0]++' file > temp && awk '!seen[$0]++' file >> temp
```
solves your problem. The lines are not in the original order.
If you want lines which have two or more duplicates, you can now iterate the above:
```
awk 'seen[$0]++' file | awk 'seen[$0]++' > temp
```
keeps n-2 copies of the lines which have n>1 duplicates. Now
```
awk '!seen[$0]++' temp > temp1 
```
removes all duplicate lines from the temp file, and you can now obtain what you wish (i.e. only the lines with n>1 duplicates) as follows:
```
cat temp1 >> temp; cat temp1 >> temp
```
If you need to do this for lines which appear N or more times, the following command
```
  awk 'seen[$0]++ && seen[$0] > N' file 
```
is simpler than chaining N times the command awk 'seen[$0]++' file.

Best Answer

Related Solutions

Linux Bash Script, Single Command But Multiple Lines

A Few Shortcuts

Linux – Remove non-duplicate lines in Linux

Related Question