Shell – Compare lines and upgrade two different files

mergeshell-scripttext processing

I have a couple of text files with the following features:

$ cat file_1
Line A
Line B
Line C
Line D

$ cat file_2
Line A
Line added 1
Line B
Line D
Line added 2

They are such that file_1 has some lines that file_2 does not contain and vice-versa. I would like to upgrade each other with the missing lines so that both will become

Line A
Line added 1
Line B
Line C
Line D
Line added 2

The order of file_1 is preserved, but with the integrations coming from file_2 put in the same places as in file_2 (not in the head or in the tail or in random positions).

1) Is it possible to merge this way the files through an appropriate bash script?

2) Is it possible to do the same, when instead of lines I have paragraphs, that is: blocks of lines?

Best Answer

diff file_1 file_2 | grep -Ev '^<|[0-9]+d[0-9]+' | patch file_1

Explanation

-F'|' : sets the field separator to |.
NR==FNR : NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
c[$1$2]++; next : if this is the 1st file, save the 1st two fields in the c array. Then, skip to the next line so that this is only applied on the 1st file.
c[$1$2]>0 : the else block will only be executed if this is the second file so we check whether fields 1 and 2 of this file have already been seen (c[$1$2]>0) and if they have been, we print the line. In awk, the default action is to print the line so if c[$1$2]>0 is true, the line will be printed.

Alternatively, since you tagged with Perl:

perl -e 'open(A, "file2"); while(<A>){/.+?\|[^|]+/ && $k{$&}++};
         while(<>){/.+?\|[^|]+/ && do{print if defined($k{$&})}}' file1

Explanation

The first line will open file2, read everything up to the 2nd | (.+?\|[^|]+) and save that (the $& is the result of the last match operator) in the %k hash.

The second line processes file1, uses the same regex to extract the 1st two columns and print the line if those columns are defined in the %k hash.

Both of the above approaches will need to hold the 2 first columns of file2 in memory. That shouldn't be a problem if you only have a few hundred thousand lines but if it is, you could do something like

cut -d'|' -f 1,2 file2 | while read pat; do grep "^$pat" file1; done

But that will be slower.

Bash – Elegant Way to Merge Lines with Multi-Char Delimiter

The elegance may come from the correct regex. Instead of changing every \r to a \n (s/\r/\n/g) you can convert every line terminator \r\n, \r, \n to the delimiter you want (in GNU sed, as few sed implementations will understand \r, and not all will understand -E):

sed -E 's/\r\n|\r|\n/; /g'

Or, if you want to remove empty lines, any run of such line terminators:

sed -E 's/[\r\n]+/; /g'

That will work if we are able to capture all line terminators in the pattern space. That means to slurp the whole file into memory to be able to edit them.

So, you can use the simpler (one command for GNU sed):

sed -zE 's/[\r\n]+/; /g; s/; $/\n/' "$filepathvar"

The -z takes null bytes as line terminators effectively getting all \r and \n in the pattern space.

The s/[\r\n]+/; /g converts all types of line delimiters to the string you want.

The s/; $/\n/ converts the (last) trailing delimiter to an actual newline.

Notes

The -z sed option means to use the zero delimiter (0x00). The use of that delimiter started as a need of find to be able to process filenames with newlines (-print0) which will match the xargs (-0) option. That meant that some tools were also modified to process zero delimited strings.

That is a non-posix option that breaks files at zeros instead of newlines.

Posix text files must have no zero (NIL) bytes, so the use of that option means, in practice, to capture the whole file in memory before processing it.

Breaking files on NILs means that newline characters end being editable on the pattern space of sed. If the file happens to have some NIL bytes, the idea still works correctly for newlines, as they still end being editable in each chunk of the file.

The -z option was added to GNU sed. The ATT sed (on which posix was based) did not have such option (and still doesn't), some BSD seds also still don't.

An alternative to the -z option is to capture the whole file in memory. That could be done Posixly in some ways:

sed 'H;1h;$!d'          # capture whole file in hold space.
sed ':a;N;$!ba'         # capture whole file in pattern space.

Having all newlines (except the last one) in the pattern space makes it possible to edit them:

sed -Ee 'H;1h;$!d;x'   -e 's/(\r\n|\r|\n)/; /g

With older sed's it is also required to use the longer and more explicit (\r\n|\r|\n)+ instead of [\r\n]+ because such sed's don't understand \r or \n inside bracket expressions [].

Line oriented

A solution that works one line at a time (a \r is also a valid line terminator in this solution), which means that there is no need to keep the whole file in memory (less memory used) is possible with GNU awk:

awk -vRS='[\r\n]+' 'NR>1{printf "; "}{printf $0}END{print ""}'  file

Must be GNU awk because of the regex record separator [\r\n]+. In other awk, the record separator must be a single byte.

Best Answer

Related Solutions

Shell – compare two columns of different files and print if it matches

Explanation

Explanation

Bash – Elegant Way to Merge Lines with Multi-Char Delimiter

Notes

Line oriented

Related Question