Shell – Remove duplicates from basenames in two files

shell-scripttext processing

I have two long lists of filenames, and I want to merge these two by excluding duplicated filenames in the sense of basename. For example, I have something like

../data_folderA/file_1 
../data_folderA/file_2
...
../data_folderA/file_n

in list A and

../data_folderB/file_1 
../data_folderB/file_2
../data_folderC/fffile_1 
../data_folderC/fffile_2
...
../data_folderC/fffile_n

in list B. I want to exclude the duplicated file file_1 and file_2 in the merged list. Is there any quick way to do this?

Best Answer

in awk you could do

awk -F "/" ' 
     { filenames[$NF]=$0 ; occurence[$NF]++ ; }
 END { for (basename in filenames) 
       {   if (occurence[basename] == 1) 
           {   print filenames[basename]
           }
       }
     } '  listA  listB

There is of course a terser way to do this, and I abuse "{ ... }", but I hope it's clear:

we first fill a "filenames[]" array using index name $NF (= LAST field in the line using "/" as separator, ie the basename).
And we also count the number of $NF we saw, thanks to the occurence[] array (if more than "1", we only have the latest one in filenames[$NF], and we have occurence[$NF]>1)
Then we only print those that have a occurence == 1

Related Solutions

Lum – Merging/Combining 2 text files according to numeric field

Use join:

$ join -t'|' file_1 file_2
14595|Age 35|Salary xx|Position ax|2013|Info 1|Info 2|Info 3|Info 4|Info 5|Address xx|Info 6|Info 7|Info 8
14649|Age 30|Salary xx|Position az|2015|Info 1|Info 2|Info 3|Info 4|Info 5|Address xxxz|Info 6|Info 7|Info 8

-t indicates the field separator.

In order to join works, files must te sorted. You can use sort for it.

How to Join Multiple Files Using Text Processing

You are almost there. Using your command, we get:

$ join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file_1 file_2 | join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL - file_3
1       a       NULL
2       b       NULL
3       c       c
4       NULL    d       d
5       NULL    e       e
6       f

Lines just don't have the same number of columns because we are not setting a format for the right-hand join in the pipeline.

If we add it as -o 0,1.2,1.3,2.2 (the join field + the second and third columns from the first join + the second column of file_3):

$ join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file_1 file_2 | join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,1.3,2.2 - file_3
1       a       NULL    NULL
2       b       NULL    NULL
3       c       c       NULL
4       NULL    d       d
5       NULL    e       e
6       NULL    NULL    f

Finally, if we can assume the GNU implementation of join, we can let it do the job of inferring the right format and use -o auto instead of -o 0,1.2,2.2 and -o 0,1.2,1.3,2.2, provided that, for each file, all lines have at most the same number of fields as the first one. Quoting info join:

-o auto
If the keyword auto is specified, infer the output format from the first line in each file. This is the same as the default output format but also ensures the same number of fields are output for each line. Missing fields are replaced with the -e option and extra fields are discarded.

Best Answer

Related Solutions

Lum – Merging/Combining 2 text files according to numeric field

How to Join Multiple Files Using Text Processing

Related Question