linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

linuxtext processinguniq

My question is similar to this question but with a couple of different constraints:

  • I have a large \n delimited wordlist — one word per line. Size of
    files range from 2GB to as large as 10GB.
  • I need to remove any duplicate lines.
  • The process may sort the list during the course of removing the duplicates but not required.
  • There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst
awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst
awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

Best Answer

Try using sort with the -o/--output=FILE option instead of redirecting the output. You might also try setting the buffer-size with the -S/--buffer-size=SIZE. Also, try -s/--stable. And read the man page, it offers all of the info I gave.

The full command you can use that might work for what you're doing:

sort -us -o wordlist_unique.lst wordlist.lst

You might also want to read the following URL:

http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html

That more thoroughly explains sort than the man page.

Related Question