Shell – Merge some tab-delimited files

I have 100 files with 57,816 rows each.I want to do an outer join of these files by merging on basis of the common column.

I am using R programming for this but this is very slow.

fileList <- list.files(, pattern=".txt")
pm_list=lapply(fileList, read.table)
merged_pm=merge_all(pm_list, by = "gene_short_name")

Is there any quick way to do this in bash?
The other method that I can use is SQL but I have to create 100 tables first and then load them up before join,which is not very efficient way to do.

The number of rows are equal in each file, thats why I want to merge based on common column and can't use cbind from R as the constants in the common column are a bit up and down in different files and not present at same location in each file.Below are two sample files.I want to join based on 'gene_short_name'

gene_short_name FPKM56

MT-TF   0.90
MT-TV   0
MT-RNR1 310.015
MT-TL1  0
MT-TM   0

File 2 is below:

gene_short_name FPKM53

MT-TF   0
MT-TV   0.344
MT-TM   0.10
MT-TL1  0
MT-RNR1 0
MT-ND2  158.332

#!/bin/sh if test $# -lt 2 then echo usage: gjoin file1 file2 ... exit 1 fi sort -t $'\t' -k 1 "$1" > result shift for f in "$@" do sort -t $'\t' -k 1 "$f" > temp join -1 1 -2 1 -t $'\t' result temp > newresult mv newresult result done cat result rm result temp

Best Answer

The following script ought to do an outer join on column (field) 1 of all the tab-delimited files passed as arguments. It uses the join command, which does an outer join on sorted files, 2 files at a time.

It will join every line in the files, including the header lines. If you want the headers to be excluded, change the two sort commands to something that produces a sorted file that omits them.

If you have an older shell, $'\t' will not be replaced by a tab, so you'll need to use 'TAB', where you put a literal tab between the quotes.

Optimizations are possible if, instead of /bin/sh, you can use a modern shell such as bash or ksh; for instance, the lines

sort -t $'\t' -k 1 "$f" > temp
join -1 1 -2 1 -t $'\t' result temp > newresult

can be replaced by

join -1 1 -2 1 -t $'\t' result <(sort -t $'\t' -k 1 "$f") > newresult

Best Answer

Related Solutions

Related Question