Remove Duplicate Entries from CSV File – Text Processing Guide

filestext processing

I've got a [csv] file with duplicate datum reprinted ie the same data printed twice. I've tried using sort's uniq
by sort myfile.csv | uniq -u however there is no change in the myfile.csv, also I've tried sudo sort myfile.csv | uniq -u but no difference.

So currently my csv file looks like this

a
a
a
b
b
c
c
c
c
c

I would like to look like it

a
b
c

Best Answer

The reason the myfile.csv is not changing is because the -u option for uniq will only print unique lines. In this file, all lines are duplicates so they will not be printed out.

However, more importantly, the output will not be saved in myfile.csv because uniq will just print it out to stdout (by default, your console).

You would need to do something like this:

$ sort -u myfile.csv -o myfile.csv

The options mean:

-u - keep only unique lines
-o - output to this file instead of stdout

You should view man sort for more information.

Related Solutions

Shell – How to count duplicated last columns without removing them

You could run into trouble storing large files in memory, this is slightly better as it only stores matching lines, after sort has done the heavy lifting of putting the lines in order.

# Input must be sorted first, then we only need to keep matching lines in memory
# Once we reach a non-matching line we print the lines in memory, prefixed by count
# with awk, variables are unset to begin with so, we can get away without explicitly initializing
{ # S2, S3, S4 are saved field values
  if($2 == S2 && $3 == S3 && $4 == S4) {
    # if fields 2,3,4 are same as last, save line in array, increment count
    line[count++] = $0;
  } else {
    # new line with fields 2, 3, 4 different
    # print stored lines, prefixed by the count
    for(i in line) {
      print count, line[i];
    }
    # reset counter and array
    count=0;
    delete line;
    # save this line in array, increment count
    line[count++] = $0;
  }

  # store field values to compare with next line read
  S2 = $2; S3 = $3; S4 = $4;
}
END{ # on EOF we still have saved lines in array, print last lines
    for(i in line) {
      print count, line[i];
    }
}

It is customary to save awk scripts in a file.
You could use this along the lines of
sort -k2,4 file | awk -f script

3 ID-fred   4.0  6.0  42.0  
3 ID-jacob  4.0  6.0  42.0  
3 ID-tessa  4.0  6.0  42.0
2 ID-elsa   5.0  8.0  45.0  
2 ID-trudy  5.0  8.0  45.0  
1 ID-gerard 6.0  8.0  20.0

How to merge first two lines of a csv column-by-column

Try this

$ awk -F, 'NR<2{split(gensub(/Citty/,"City","g",$0),a,FS)}NR==2{for(b=2;b<=NF;b+=2){c=c a[b]" "$b","}print gensub(/,$/,"",1,c)}NR>2{print gensub(/(^,|" *",)/,"","g",$0)}' inp
Product Name,City Location,Price Per Unit
banana,CA,5.7
apple,FL,2.3
$

Same code is more readable if split across a few lines :

$ awk -F, '
> NR<2{split(gensub(/Citty/,"City","g",$0),a,FS)}
> NR==2{for(b=2;b<=NF;b+=2){c=c a[b]" "$b","}print gensub(/,$/,"",1,c)}
> NR>2{print gensub(/(^,|" *",)/,"","g",$0)}' inp
Product Name,City Location,Price Per Unit
banana,CA,5.7
apple,FL,2.3
$

If 1st line, split the line into array elements within a. Fix the Citty->City typo.

If 2nd line, starting with the 2nd column, print the corresponding column from 1st line together with this column. Repeat for each column, going in 2 column increments. Strip the trailing ,.

After 2nd line, replace any leading , or any "<spaces>", with an empty string and then print the result.

Tested ok on GNU Awk 4.0.2

Try it online!

Best Answer

Related Solutions

Shell – How to count duplicated last columns without removing them

How to merge first two lines of a csv column-by-column

Related Question