The size of the file is 962,120,335 bytes.
HP-UX ******B.11.31 U ia64 ****** unlimited-user license
hostname> what /usr/bin/awk
/usr/bin/awk:
main.c $Date: 2009/02/17 15:25:17 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
run.c $Date: 2009/02/17 15:25:20 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
$Revision: @(#) awk R11.31_BL2010_0503_1 PATCH_11.31 PHCO_40052
hostname> what /usr/bin/sed
/usr/bin/sed:
sed0.c $Date: 2008/04/23 11:11:11 $Revision: r11.31/1 PATCH_11.31 (PHCO_38263)
$Revision: @(#) sed R11.31_BL2008_1022_2 PATCH_11.31 PHCO_38263
hostname>perl -v
This is perl, v5.8.8 built for IA64.ARCHREV_0-thread-multi
hostname:> $ file /usr/bin/perl
/usr/bin/perl: ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/awk
/usr/bin/awk: ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/sed
/usr/bin/sed: ELF-32 executable object file - IA64
There are no GNU tools here.
What are my options?
How to remove duplicate lines in a large multi-GB textfile?
and
http://en.wikipedia.org/wiki/External_sorting#External_merge_sort
perl -ne 'print unless $seen{$_}++;' < file.merge > file.unique
throws
Out of Memory!
The resultant file of 960MB is merged from files of these sizes listed below, the average being 50 MB
22900038,
24313871,
25609082,
18059622,
23678631,
32136363,
49294631,
61348150,
85237944,
70492586,
79842339,
72655093,
73474145,
82539534,
65101428,
57240031,
79481673,
539293,
38175881
Question: How to perform external sort merge and deduplicate this data? Or, how to deduplicate this data?
Best Answer
It seems to me that the process you're following at the moment is this, which fails with your out of memory error:
I think you should be able to perform the following process instead
sort -u
)sort -m -u
)