Ubuntu – Command that will only print value once although it appears many times

bashcommand line

I have a big txt file in which values are are repeating many times. Is there some command that I can use that will go through file and if one value appears once do not repeat it again?

SO4
HOH
CL
BME
HOH
SO4
HOH
CL
BME
HOH
SO4
HOH
SO4
HOH
CL
BME
HOH
SO4
HOH
CL
BME
HOH
CL

So it should look something like this:

S04   
HOH  
CL   
BME

The thing is that I have huge number of different values, so can't do it manualy like here.

Best Answer

You could use the command sort with the option --unique:

sort -u input-file

If you want to write result to FILE instead of standard output, use the option --output=FILE:

sort -u input-file -o output-file

The command uniq also could be applied. In this case the identical lines must be consequential, so the input must be sorted preliminary - thanks to @RonJohn for this note:

sort input-file | uniq > output-file

I like the sort command for similar cases, because of its simplicity, but if you work with large arrays the awk approach from John1024's answer could be more powerful. Here is a time comparison between the mentioned approaches, applied on a file (based on the above example) with almost 5 million lines:

$ cat input-file | wc -l
20000000

$ TIMEFORMAT=%R
$ time sort -u input-file | wc -l
64
7.495

$ time sort input-file | uniq | wc -l
64
7.703

$ time awk '!a[$0]++' input-file | wc -l      # from John1024's answer
64
1.271

$ time datamash rmdup 1 < input-file | wc -l  # from αғsнιη's answer
64
0.770

Other significant difference is that mentioned by @Ruslan:

sort -u will only print the result once the input has ended, while this awk command will do print each new result line on the fly (this may be more important for piped input than file).

Here is an illustration:

In the above example, the loop (shown below) generates 500 random combinations, each with a length of three characters, of the letters A-D. These combinations are piped to awk or sort.

for i in {1..500}; do cat /dev/urandom | tr -dc A-D | head -c 3; echo; done

Explanation

0777 : -0 sets sets the input record separator (perl special variable $/ which is a newline by default). Setting this to a value greater than 0400 will cause Perl to slurp the entire input file into memory.
pe : the -p means "print each input line after applying the script given by -e to it".
$_=$_ x 1000 : $_ is the current input line. Since we're reading the entire file at once because of -0700, this means the entire file. The x 1000 will result in 1000 copies of the entire file being printed.

Best Answer

Related Solutions

Ubuntu – Print only the first match once

How to Repeat File Content Multiple Times Using Command Line

Explanation

Related Question