Shell – Merge some tab-delimited files

awkjoin;shell-scripttext processing

I have 100 files with 57,816 rows each.I want to do an outer join of these files by merging on basis of the common column.

I am using R programming for this but this is very slow.

fileList <- list.files(, pattern=".txt")
pm_list=lapply(fileList, read.table)
merged_pm=merge_all(pm_list, by = "gene_short_name")

Is there any quick way to do this in bash?
The other method that I can use is SQL but I have to create 100 tables first and then load them up before join,which is not very efficient way to do.

The number of rows are equal in each file, thats why I want to merge based on common column and can't use cbind from R as the constants in the common column are a bit up and down in different files and not present at same location in each file.Below are two sample files.I want to join based on 'gene_short_name'

gene_short_name FPKM56

MT-TF   0.90
MT-TV   0
MT-RNR1 310.015
MT-TL1  0
MT-TM   0

File 2 is below:

gene_short_name FPKM53

MT-TF   0
MT-TV   0.344
MT-TM   0.10
MT-TL1  0
MT-RNR1 0
MT-ND2  158.332

Best Answer

The following script ought to do an outer join on column (field) 1 of all the tab-delimited files passed as arguments. It uses the join command, which does an outer join on sorted files, 2 files at a time.

It will join every line in the files, including the header lines. If you want the headers to be excluded, change the two sort commands to something that produces a sorted file that omits them.

#!/bin/sh
if test $# -lt 2
then
    echo usage: gjoin file1 file2 ...
    exit 1
fi
sort -t $'\t' -k 1 "$1" > result
shift
for f in "$@"
do
    sort -t $'\t' -k 1 "$f" > temp
    join -1 1 -2 1 -t $'\t' result temp > newresult
    mv newresult result
done
cat result
rm result temp

If you have an older shell, $'\t' will not be replaced by a tab, so you'll need to use 'TAB', where you put a literal tab between the quotes.

Optimizations are possible if, instead of /bin/sh, you can use a modern shell such as bash or ksh; for instance, the lines

sort -t $'\t' -k 1 "$f" > temp
join -1 1 -2 1 -t $'\t' result temp > newresult

can be replaced by

join -1 1 -2 1 -t $'\t' result <(sort -t $'\t' -k 1 "$f") > newresult
Related Question