Python – Selecting lines in a file that do not contain the value in the other file

awkcsvgreppythontext processing

I have two files. One is a CSV of data with 60490 lines. Each line is a set of values such as a customer name, service dates, etc.

One of these values in the first file is a VIN value.

I have a second large file that contains a list of 92809 VINs.

I need to find a way to delete all lines in the first file that have a VIN listed in the second file.

I've tried the grep below. This does do what I expect it to do but it's also extremely slow and is killed by the OS after outputting about 50 lines.

$ grep -v -f vinlist data.csv > output.csv

What is the fastest way to achieve this? Everything I've found while researching seems to require fully sorted data with lines in both files being identical or takes an extremely long time/gets killed by OS

data.csv

123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1C4NJPBB4DD122174,2014-01-20  
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1GMDV33179D147281,2014-01-20  
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1FUYDCYB7WP879651,2014-01-20  
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1FM5K8D8XFGA82149,2014-01-20  
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,5TDBT48A72S003496,2014-01-20  

vinlist:

JF1VA1E6XH9812361  
1HGCP26369A103521  
3N1CN7AP0CL810631  
5XYZK3AB7BG089758  
1FM5K8D8XFGA82149  
4S3BMBG61C3019520  
1FTNE24LX4HA22330  
1N4AL3AP8FC420210  
2GTEC19C491123429  
3N1CN7AP5FL944233  

Best Answer

Since we know where the vin is in the line, we do not need to do a generic search. Instead we can read the vins into a data structure that is indexed to allow fast look ups.

Using Python you can do:

Code:

# read the vins into a set to allow fast lookup
with open('file3', 'rU') as f:
    vins = {vin.strip() for vin in f.readlines()}

# go through the data file one line at a time
with open('file2', 'rU') as f:
    for line in f.readlines():

        # get the vin in the line
        vin = line.split(',')[8]

        # if the vin is not in our set, print out the line
        if vin not in vins:
            print(line.strip())

Results:

123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1C4NJPBB4DD122174,2014-01-20
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1GMDV33179D147281,2014-01-20
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1FUYDCYB7WP879651,2014-01-20
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,5TDBT48A72S003496,2014-01-20
Related Question