linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

linuxtext processinguniq

My question is similar to this question but with a couple of different constraints:

I have a large \n delimited wordlist — one word per line. Size of
files range from 2GB to as large as 10GB.
I need to remove any duplicate lines.
The process may sort the list during the course of removing the duplicates but not required.
There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst
awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

Best Answer

Try using sort with the -o/--output=FILE option instead of redirecting the output. You might also try setting the buffer-size with the -S/--buffer-size=SIZE. Also, try -s/--stable. And read the man page, it offers all of the info I gave.

The full command you can use that might work for what you're doing:

sort -us -o wordlist_unique.lst wordlist.lst

You might also want to read the following URL:

http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html

That more thoroughly explains sort than the man page.

Related Solutions

Text Processing – How to Delete Duplicate Lines Pairwise

I worked out the sed answer not long after I posted this question; no one else has used sed so far so here it is:

sed '$!N;/^\(.*\)\n\1$/d;P;D'

A little playing around with the more general problem (what about deleting lines in sets of three? Or four, or five?) provided the following extensible solution:

sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp

Extended to remove triples of lines:

sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp

Or to remove quads of lines:

sed -e ':top' -e '$!{/\n.*\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1\n\1$/d;P;D' temp

sed has an additional advantage over most other options, which is its ability to truly operate in a stream, with no more memory storage needed than the actual number of lines to be checked for duplicates.

As cuonglm pointed out in the comments, setting the locale to C is necessary to avoid failures to properly remove lines containing multi-byte characters. So the commands above become:

LC_ALL=C sed '$!N;/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp
# Etc.

Bash – Remove nearly duplicate lines

How about joining adjacent pairs of lines, and then using a backreference to find the non-unique prefix?

$ sed '$!N; /\(.*\)\n\1:FOO/D; P;D' file
red.7
green.2:FOO
blue.6
yellow.9:FOO

Explanation:

$!N - if we are not already at the last line, append the next line to the pattern space, separated by a newline
/$.*$\n - match everything up to the newline (i.e. the first of each pair of lines) and save it into a capture group
\1:FOO now matches whatever was captured from the first line, followed by :FOO (\1 is a backreference to the first capture group)
/$.*$\n\1:FOO/D - if the second line of each pair is the same as the first followed by :FOO, then Delete the first
Print and Delete the remaining line ready to start the next cycle

or neater (thanks @don_crissti)

 sed '$!N; /$.*$\n\1:FOO/!P;D' file
N means there are always two consecutive lines in the pattern space and sed Prints the first one of them only if the second line isn't the same as the first one plus the suffix :FOO. Then D removes the first line from the pattern space and restarts the cycle.

Best Answer

Related Solutions

Text Processing – How to Delete Duplicate Lines Pairwise

Bash – Remove nearly duplicate lines

Related Question