Text Processing – Remove Unique Strings from a Text File

sorttext processing

Sorry guys I had to edit my example, because I didn't express my query properly.
Let's say I have the .txt file:

Happy sad
Happy sad
Happy sad
Sad happy
Happy sad
Happy sad
Mad sad
Mad happy
Mad happy

And I want to delete any string that is unique. Leaving the file with:

Happy sad
Happy sad
Happy sad
Happy sad
Happy sad
Mad happy
Mad happy

I understand that sort is able to get rid of duplicates (sort file.txt | uniq), so is there anyway we can do the opposite in bash using a command? Or would I just need to figure out a while loop for it?
BTW uniq -D file.txt > output.txt doesn't work.

Best Answer

Using awk:

$ awk 'seen[$0]++; seen[$0] == 2' file
Happy sad
Happy sad
Happy sad
Happy sad
Happy sad
Mad happy
Mad happy

This uses the text of each line as the key into the associative array seen. The first seen[$0]++ will cause a line that has been seen before to be printed since the value associated with the line will be non-zero on the second and subsequent times the line is seen. The seen[$0] == 2 causes the line to be printed again if this is the second time the line has been seen (without this, you'll miss one occurrence of each duplicated line).

This is related to awk '!seen[$0]++' which is sometimes used to remove duplicates without sorting (see e.g. How does awk '!a[$0]++' work?).

To only get one copy of the duplicated lines:

awk 'seen[$0]++ == 1' file

or,

sort file | uniq -d

Related Solutions

linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

Try using sort with the -o/--output=FILE option instead of redirecting the output. You might also try setting the buffer-size with the -S/--buffer-size=SIZE. Also, try -s/--stable. And read the man page, it offers all of the info I gave.

The full command you can use that might work for what you're doing:

sort -us -o wordlist_unique.lst wordlist.lst

You might also want to read the following URL:

http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html

That more thoroughly explains sort than the man page.

Best Answer

Related Solutions

linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

Related Question