Shell Script – How to Concatenate Multiple Files with Same Header

shell-scripttext processing

I have multiple files with the same header and different vectors below that. I need to concatenate all of them but I want only the header of first file to be concatenated and I don't want other headers to be concatenated since they are all same.

for example:
file1.txt

<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
A
B 
C

file2.txt

<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
D
E 
F

I need the output to be

<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
A
B
C
D
E 
F

I could write a script in R but I need it in shell?

Best Answer

If you know how to do it in R, then by all means do it in R. With classical unix tools, this is most naturally done in awk.

awk '
    FNR==1 && NR!=1 { while (/^<header>/) getline; }
    1 {print}
' file*.txt >all.txt

The first line of the awk script matches the first line of a file (FNR==1) except if it's also the first line across all files (NR==1). When these conditions are met, the expression while (/^<header>/) getline; is executed, which causes awk to keep reading another line (skipping the current one) as long as the current one matches the regexp ^<header>. The second line of the awk script prints everything except for the lines that were previously skipped.

Related Question