Remove duplicate lines while keeping the order of the lines

awkperlsedtext processing

[root@server]# awk '!seen[$0]++' out.txt > cleaned
awk: (FILENAME=out.txt FNR=8547098) fatal error: internal error
Aborted
[root@server]#

The ""server"" has: 8 GByte RAM + 16 GByte SWAP, x>300 GByte free space, amd64, desktop CPU. Scientific Linux 6.6. Nothing else runs on it to make LOAD. Awk aborts after a few seconds.. out.txt is ~1.6 GByte. GNU Awk 3.1.7.

Question: How can I remove the duplicate lines while keeping the order of the lines? Case is important too, ex: "A" and "a" is two different line, have to keep it. But "a" and "a" is duplicate, only the first one is needed.

Answer could be in anything.. if awk is not good for this.. then perl/sed.. what could the problem be?

[root@server]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 61945
max locked memory       (kbytes, -l) 99999999
max memory size         (kbytes, -m) unlimited
open files                      (-n) 999999
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 99999999
cpu time               (seconds, -t) unlimited
max user processes              (-u) 61945
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[root@server]# 

Update: I tried this on a RHEL machine, it doesn't aborts, but I didn't had time to wait for it to finish.. why doesn SL linux differ from RHEL?

Update: I'm trying on an Ubuntu 14 virtual gues.. so far it works! It's not an ulimit problem: mawk 1.3.3

root@asdf-VirtualBox:~# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 51331
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 51331
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
root@asdf-VirtualBox:~# 

Best Answer

I doubt it will make a difference but, just in case, here's how to do the same thing in Perl:

perl -ne 'print if ++$k{$_}==1' out.txt

If the problem is keeping the unique lines in memory, that will have the same issue as the awk you tried. So, another approach could be:

cat -n out.txt | sort -k2 -k1n  | uniq -f1 | sort -nk1,1 | cut -f2-

How it works:

  1. On a GNU system, cat -n will prepend the line number to each line following some amount of spaces and followed by a <tab> character. cat pipes this input representation to sort.

  2. sort's -k2 option instructs it only to consider the characters from the second field until the end of the line when sorting, and sort splits fields by default on white-space (or cat's inserted spaces and <tab>).
    When followed by -k1n, sort considers the 2nd field first, and then secondly—in the case of identical -k2 fields—it considers the 1st field but as sorted numerically. So repeated lines will be sorted together but in the order they appeared.

  3. The results are piped to uniq—which is told to ignore the first field (-f1 - and also as separated by whitespace)—and which results in a list of unique lines in the original file and is piped back to sort.
  4. This time sort sorts on the first field (cat's inserted line number) numerically, getting the sort order back to what it was in the original file and pipes these results to cut.
  5. Lastly, cut removes the line numbers that were inserted by cat. This is effected by cut printing only from the 2nd field through the end of the line (and cut's default delimiter is a <tab> character).

To illustrate:

$ cat file
bb
aa
bb
dd
cc
dd
aa
bb
cc
$ cat -n file | sort -k2 | uniq -f1 | sort -k1 | cut -f2-
bb
aa    
dd
cc
Related Question