File Comparison – How to Find the Difference Between Two Big Files

awkdiff()grepsed

I have "test1.csv" and it contains

200,400,600,800
100,300,500,700
50,25,125,310

and test2.csv and it contains

100,4,2,1,7
200,400,600,800
21,22,23,24,25
50,25,125,310
50,25,700,5

now

diff test2.csv test1.csv > result.csv

is different than

diff test1.csv test2.csv > result.csv

I don't know which is the correct order but I want something else, both of the commands above will output something like

2 > 100,4,2,1,7
   3 2,3c3,5
   4 < 100,300,500,700
   5 < 50,25,125,310
   6 \ No newline at end of file
   7 ---
   8 > 21,22,23,24,25
   9 > 50,25,125,310

I want to output only the difference, thus results.csv should look like this

100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

I tried diff -q and diff -s but they didn't do the trick. Order doesn't matter, what matters is that I want to see only the difference, no > nor < nor blank space.

grep -FvF did the trick on smaller files not on big ones

first file contains more than 5 million lines, second file contains 1300.

so results.csv should result in ~4,998,700 lines

I also tried grep -F -x -v -f which didn't work.

Best Answer

Sounds like a job for comm:

$ comm -3 <(sort test1.csv) <(sort test2.csv)
100,300,500,700
    100,4,2,1,7
    21,22,23,24,25
    50,25,700,5

As explained in man comm:

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

So, the -3 means that only lines that are unique to one of the files will be printed. However, those are indented according to which file they were found in. To remove the tab, use:

$ comm -3 <(sort test1.csv) <(sort test2.csv) | tr -d '\t'
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

In this case, you don't really even need to sort the files and you can simplify the above to:

comm -3 test1.csv test2.csv | tr -d '\t' > difference.csv
Related Question