Ubuntu – Remove duplicated from two files and merge the unique ones

command linetext processing

I Have two big text files, checksums_1.txt and checksums_2.txt, I want to parse these files and remove duplication between them and merge the unique lines in one file.

Each file have the following structure for each line.

size, md5, path

Example: Checksums_1.txt

9565, a4fs614as6fas4fa4s46fsaf1, /mnt/app/1tier/2tier/filename.exe
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/app/1tier/2tier/filename2.exe

Example: Checksums_2.txt

9565, a4fs614as6fas4fa4s46fsaf1, /mnt/temp/1tier/2tier/filename.exe
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/temp/1tier/2tier/filename2.exe
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/temp/1tier/2tier/newfile.exe

The section that have to be used to check between the checksums_1.txt and checksums_2.txt is after the mountpoint /mnt/app/ and /mnt/temp/, In other words, from the start of each line to the end of the mountpoint /mnt/temp/ or /mnt/app/ will be ignored.

The data inside checksums_1.txt is more important, so if a a duplicated is found the line in checksums_1.txt must be moved to the merged file.

Part of Checksums_1.txt

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index

Part of Checksums_2.txt

1058,b8203a236b4f1531616318284202c9e6,/mnt/temp/Certificados/ca.crt
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial 
2694,8a815adefde4fa0c263e74832b15de64,/mnt/temp/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/temp/Certificados/ca.db.index

Example of the merged file

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt 
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial 
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index

Best Answer

Assuming both files are not huge, the python script below will do the job as well.

How it works

Both files are read by the script. The lines in file_1 (the file that has precedence) is split by the directory you entered for the file in the head section (in your example /mnt/app/).

Subsequently, the lines in file_1 are written to the output file (the merged file). At the same time, lines from file_2 are removed from the line list if the identifying string (the section after the mount point) occurs in the line. Finally, the "remaining" lines of file_2 (of which no dupe exist in file_1) are written to the output file as well. The result:

file_1:

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index

file_2:

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index

merged:

1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial

The script

#!/usr/bin/env python3
#---set the path to file1, file2 and the mountpoint used in file1 below
f1 = "/path/to/file_1"; m_point = "/mnt/app"; f2 = "/path/to/file_2"
merged = "/path/to/merged_file"
#---
lines1 = [(l, l.split(m_point)[-1]) for l in open(f1).read().splitlines()]
lines2 = [l for l in open(f2).read().splitlines()]

for l in lines1:
    open(merged, "a+").write(l[0]+"\n")
    for line in [line for line in lines2 if l[1] in line]:
            lines2.remove(line)

for l in lines2:
    open(merged, "a+").write(l+"\n")

How to use

  1. Copy the script into an empty file, save it as merge.py
  2. in the head section of the script, set the paths to f1 (file_1), f2, the path to the merging file and the mountpoint as mentioned in file_1.
  3. Run it by the command:

    python3 /path/to/merge.py
    

Edit

Or a tiny bit shorter:

#!/usr/bin/env python3
#---set the path to file1, file2 and the mountpoint used in file1 below
f1 = "/path/to/file_1"; m_point = "/mnt/app"; f2 = "/path/to/file_2"
merged = "/path/to/merged_file"
#---
lines = lambda f: [l for l in open(f).read().splitlines()]
lines1 = lines(f1); lines2 = lines(f2); checks = [l.split(m_point)[-1] for l in lines1]
for item in sum([[l for l in lines2 if c in l] for c in checks], []):
    lines2.remove(item)
for item in lines1+lines2:
    open(merged, "a+").write(item+"\n")
Related Question