Shell – Remove duplicates from basenames in two files

shell-scripttext processing

I have two long lists of filenames, and I want to merge these two by excluding duplicated filenames in the sense of basename. For example, I have something like

../data_folderA/file_1 
../data_folderA/file_2
...
../data_folderA/file_n

in list A and

../data_folderB/file_1 
../data_folderB/file_2
../data_folderC/fffile_1 
../data_folderC/fffile_2
...
../data_folderC/fffile_n

in list B. I want to exclude the duplicated file file_1 and file_2 in the merged list. Is there any quick way to do this?

Best Answer

in awk you could do

awk -F "/" ' 
     { filenames[$NF]=$0 ; occurence[$NF]++ ; }
 END { for (basename in filenames) 
       {   if (occurence[basename] == 1) 
           {   print filenames[basename]
           }
       }
     } '  listA  listB

There is of course a terser way to do this, and I abuse "{ ... }", but I hope it's clear:

  • we first fill a "filenames[]" array using index name $NF (= LAST field in the line using "/" as separator, ie the basename).

  • And we also count the number of $NF we saw, thanks to the occurence[] array (if more than "1", we only have the latest one in filenames[$NF], and we have occurence[$NF]>1)

  • Then we only print those that have a occurence == 1

Related Question