Deduplication of lines in a large file

deduplicationlarge filestext processing

The size of the file is 962,120,335 bytes.

HP-UX ******B.11.31 U ia64 ****** unlimited-user license

hostname> what /usr/bin/awk
/usr/bin/awk:
         main.c $Date: 2009/02/17 15:25:17 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
         run.c $Date: 2009/02/17 15:25:20 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
         $Revision: @(#) awk R11.31_BL2010_0503_1 PATCH_11.31 PHCO_40052
hostname> what /usr/bin/sed
/usr/bin/sed:
         sed0.c $Date: 2008/04/23 11:11:11 $Revision: r11.31/1 PATCH_11.31 (PHCO_38263)
         $Revision: @(#) sed R11.31_BL2008_1022_2 PATCH_11.31 PHCO_38263
 hostname>perl -v
    This is perl, v5.8.8 built for IA64.ARCHREV_0-thread-multi
hostname:> $ file /usr/bin/perl
/usr/bin/perl:  ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/awk
/usr/bin/awk:   ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/sed
/usr/bin/sed:   ELF-32 executable object file - IA64

There are no GNU tools here.
What are my options?

How to remove duplicate lines in a large multi-GB textfile?

and

http://en.wikipedia.org/wiki/External_sorting#External_merge_sort

perl -ne 'print unless $seen{$_}++;' < file.merge > file.unique

throws

Out of Memory!

The resultant file of 960MB is merged from files of these sizes listed below, the average being 50 MB
22900038,
24313871,
25609082,
18059622,
23678631,
32136363,
49294631,
61348150,
85237944,
70492586,
79842339,
72655093,
73474145,
82539534,
65101428,
57240031,
79481673,
539293,
38175881

Question: How to perform external sort merge and deduplicate this data? Or, how to deduplicate this data?

Best Answer

It seems to me that the process you're following at the moment is this, which fails with your out of memory error:

  1. Create several data files
  2. Concatenate them together
  3. Sort the result, discarding duplicate records (rows)

I think you should be able to perform the following process instead

  1. Create several data files
  2. Sort each one independently, discarding its duplicates (sort -u)
  3. Merge the resulting set of sorted data files, discarding duplicates (sort -m -u)
Related Question