CSV Duplicate Lines – Print Duplicate Lines Based on Fields 1 and 2

bashcsvlinuxshell-scriptuniq

by the following command we can print the duplicate lines from file

uniq -d string file.txt

but how we can do it on csv file ?

we need to print the duplicate lines only on fields 1,2 from csv file – not include field 3

FS – ","

for example:

 spark2-thrift-sparkconf,spark.history.fs.logDirectory,{{spark_history_dir}}
 spark2-thrift-sparkconf,spark.history.fs.logDirectory,true
 spark2-thrift-sparkconf,spark.history.Log.logDirectory,true
 spark2-thrift-sparkconf,spark.history.DF.logDirectory,true

expected results:

 spark2-thrift-sparkconf,spark.history.fs.logDirectory,{{spark_history_dir}}
 spark2-thrift-sparkconf,spark.history.fs.logDirectory,true

second:

how exclude the duplicate lines from the csv file ( I mean to delete only the duplicate lines on fields 1,2

expected output:

 spark2-thrift-sparkconf,spark.history.Log.logDirectory,true
 spark2-thrift-sparkconf,spark.history.DF.logDirectory,true

Best Answer

$ awk -F, 'NR==FNR{a[$1,$2]++; next} a[$1,$2]>1' file.txt file.txt 
spark2-thrift-sparkconf,spark.history.fs.logDirectory,{{spark_history_dir}}
spark2-thrift-sparkconf,spark.history.fs.logDirectory,true

Two file processing using same input file twice

  • NR==FNR{a[$1,$2]++; next} using first two fields as key, save number of occurrences
  • a[$1,$2]>1 print only if count is greater than 1 during second pass


For the opposite case, simple matter of changing condition check

$ awk -F, 'NR==FNR{a[$1,$2]++; next} a[$1,$2]==1' file.txt file.txt 
spark2-thrift-sparkconf,spark.history.Log.logDirectory,true
spark2-thrift-sparkconf,spark.history.DF.logDirectory,true
Related Question