I have 100 files with 57,816 rows each.I want to do an outer join of these files by merging on basis of the common column.
I am using R programming for this but this is very slow.
fileList <- list.files(, pattern=".txt")
pm_list=lapply(fileList, read.table)
merged_pm=merge_all(pm_list, by = "gene_short_name")
Is there any quick way to do this in bash?
The other method that I can use is SQL but I have to create 100 tables first and then load them up before join,which is not very efficient way to do.
The number of rows are equal in each file, thats why I want to merge based on common column and can't use cbind from R as the constants in the common column are a bit up and down in different files and not present at same location in each file.Below are two sample files.I want to join based on 'gene_short_name'
gene_short_name FPKM56
MT-TF 0.90
MT-TV 0
MT-RNR1 310.015
MT-TL1 0
MT-TM 0
File 2 is below:
gene_short_name FPKM53
MT-TF 0
MT-TV 0.344
MT-TM 0.10
MT-TL1 0
MT-RNR1 0
MT-ND2 158.332
Best Answer
The following script ought to do an outer join on column (field) 1 of all the tab-delimited files passed as arguments. It uses the join command, which does an outer join on sorted files, 2 files at a time.
It will join every line in the files, including the header lines. If you want the headers to be excluded, change the two
sort
commands to something that produces a sorted file that omits them.If you have an older shell,
$'\t'
will not be replaced by a tab, so you'll need to use 'TAB', where you put a literal tab between the quotes.Optimizations are possible if, instead of
/bin/sh
, you can use a modern shell such as bash or ksh; for instance, the linescan be replaced by