How to remove duplicate lines in a CSV based on first field, and 1st n chars of 2nd field

csvtext processing

For a 3 column csv file, list.csv, how would you remove subsequent duplicate rows where the 1st field matches, and just the first 3 chars of the 2nd field match? Some rows will have a 2nd field with less than 3 chars.

list.csv:

12,12345,a
12,12345,b
123,12345,a
1234,12,b
1234,12345,a
567,567,a
567,56712,a
567,56734,a
567,6789,a

Expected output:

12,12345,a
123,12345,a
1234,12,b
1234,12345,a
567,567,a
567,6789,a

Best Answer

sort should work as well

 sort -t, -k1,1 -k2.1,2.3 -u <list.csv
 12,12345,a
 123,12345,a
 1234,12,b
 1234,12345,a
 567,567,a
 567,6789,a

Related Solutions

Text Processing – Compare 1st Column of 1st File and 2nd Column of 2nd File

An awk solution:

$ awk 'NR==FNR{a[$2]=$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9; next} 
              {
                if($1 in a){
                    print $0,a[$1]
                }
               }' file2 file1
UN          ID    St      M1    M2       SE    DOF  PV        PA            FC TID  X   E   GG7 J   O   
17127159    0   -5.9    297.3   765.7   0.22    4   0.003   0.00389231  2.57536 16657436    353.568 335.295 221.717 815.654 684.85  
17127163    2   -3.87   189.914 492.307 0.3548  4   0.0179  0.01795     2.59226 16657450    221.647 226.774 136.274 431.32  392.533

Explanation

Awk splits each input line into fields (at whitespace, by default), making the 1st field $1 the 2nd $2 etc. The special variable NR is the current input line number and FNR is the current line number of the file being read. Therefore, when processing multiple files, the two are equal only while the first file is being read.

NR==FNR{a[$2]=$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9; next} : if we're reading the first file, save fields 3 through 9 (joined by tabs) as the value in the array a whose key is the 2nd field. Then, skip to the next line.
The next ensures that the rest of the script will not be run for the first file (file2) but only the second (file1).
if($1 in a){ print $0,a[$1] } : we're now in the second file (file1). If the first field exists as a key in the a array (if($1 in a)), then print the current line $0 and the value stored in a for $1: fields 3 through 9 from file2.

CSV Duplicate Lines – Print Duplicate Lines Based on Fields 1 and 2

$ awk -F, 'NR==FNR{a[$1,$2]++; next} a[$1,$2]>1' file.txt file.txt 
spark2-thrift-sparkconf,spark.history.fs.logDirectory,{{spark_history_dir}}
spark2-thrift-sparkconf,spark.history.fs.logDirectory,true

Two file processing using same input file twice

NR==FNR{a[$1,$2]++; next} using first two fields as key, save number of occurrences
a[$1,$2]>1 print only if count is greater than 1 during second pass

For the opposite case, simple matter of changing condition check

$ awk -F, 'NR==FNR{a[$1,$2]++; next} a[$1,$2]==1' file.txt file.txt 
spark2-thrift-sparkconf,spark.history.Log.logDirectory,true
spark2-thrift-sparkconf,spark.history.DF.logDirectory,true

Best Answer

Related Solutions

Text Processing – Compare 1st Column of 1st File and 2nd Column of 2nd File

Explanation

CSV Duplicate Lines – Print Duplicate Lines Based on Fields 1 and 2

Related Question