Deduplication of lines in a large file

The size of the file is 962,120,335 bytes.

HP-UX ******B.11.31 U ia64 ****** unlimited-user license

hostname> what /usr/bin/awk
/usr/bin/awk:
         main.c $Date: 2009/02/17 15:25:17 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
         run.c $Date: 2009/02/17 15:25:20 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
         $Revision: @(#) awk R11.31_BL2010_0503_1 PATCH_11.31 PHCO_40052
hostname> what /usr/bin/sed
/usr/bin/sed:
         sed0.c $Date: 2008/04/23 11:11:11 $Revision: r11.31/1 PATCH_11.31 (PHCO_38263)
         $Revision: @(#) sed R11.31_BL2008_1022_2 PATCH_11.31 PHCO_38263
 hostname>perl -v
    This is perl, v5.8.8 built for IA64.ARCHREV_0-thread-multi
hostname:> $ file /usr/bin/perl
/usr/bin/perl:  ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/awk
/usr/bin/awk:   ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/sed
/usr/bin/sed:   ELF-32 executable object file - IA64

There are no GNU tools here.
What are my options?

How to remove duplicate lines in a large multi-GB textfile?

and

http://en.wikipedia.org/wiki/External_sorting#External_merge_sort

perl -ne 'print unless $seen{$_}++;' < file.merge > file.unique

throws

Out of Memory!

The resultant file of 960MB is merged from files of these sizes listed below, the average being 50 MB
22900038,
24313871,
25609082,
18059622,
23678631,
32136363,
49294631,
61348150,
85237944,
70492586,
79842339,
72655093,
73474145,
82539534,
65101428,
57240031,
79481673,
539293,
38175881

Question: How to perform external sort merge and deduplicate this data? Or, how to deduplicate this data?

Best Answer

It seems to me that the process you're following at the moment is this, which fails with your out of memory error:

Create several data files
Concatenate them together
Sort the result, discarding duplicate records (rows)

I think you should be able to perform the following process instead

Create several data files
Sort each one independently, discarding its duplicates (sort -u)
Merge the resulting set of sorted data files, discarding duplicates (sort -m -u)

Best Answer

Related Solutions

Deduplication on partition level

How to evaluate if it’s worth using deduplication

Related Question