My question is similar to this question but with a couple of different constraints:
- I have a large
\n
delimited wordlist — one word per line. Size of
files range from 2GB to as large as 10GB. - I need to remove any duplicate lines.
- The process may sort the list during the course of removing the duplicates but not required.
- There is enough space on the partition to hold the new unique wordlist outputted.
I have tried both of these methods but they both fail with out of memory errors.
sort -u wordlist.lst > wordlist_unique.lst
awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst
awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)
What other approaches can I try?
Best Answer
Try using sort with the
-o
/--output=FILE
option instead of redirecting the output. You might also try setting thebuffer-size
with the-S
/--buffer-size=SIZE
. Also, try-s
/--stable
. And read the man page, it offers all of the info I gave.The full command you can use that might work for what you're doing:
You might also want to read the following URL:
http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html
That more thoroughly explains sort than the man page.