Match first column of file a with paragraphs of file b

text processing

I have 2 files. The first, fileA looks like

TCONS_00000066  XLOC_000030     -       u       q1:XLOC_000030|TCONS_00000066|0|0.000000|0.000000|0.000000|0.000000|-
TCONS_00000130  XLOC_000057     -       u       q1:XLOC_000057|TCONS_00000130|0|0.000000|0.000000|0.000000|0.000000|-
TCONS_00000395  XLOC_000206     -       u       q1:XLOC_000204|TCONS_00000393|0|0.000000|0.000000|0.000000|0.000000|-

FileB looks like:

>TCONS_00000001 gene=XLOC_000001
AGATGAGCTGGTGGGGATGCTCTAAGAGAACGAGAGAAGCACAGAGCAGATAAACCACACCCACAGGCAC
CACCGTCCTTGTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAG
ATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAG
TTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAAACCCTTGAACCAAGGAGTCCTCGCTG
AGGAAGCTTTGGATCCACGACGCAGCTATGGCCTCCCCGCCCACCAGGCCGCCAGCCACAACCAGCTGAC
TAGGTCGCATGCATCATCAGATTTCAATCTCCCTTCGTTCCCTGTCCCTAATCCAATACCAATAGGGAGC
AATCAGCTGCTCCTCGACGGCGAGGGAGATGTCGTCGGCCGCGGGCCAAGACAACGGAGATACCGCTGGG
GACTACATCAAGTGGATGTGCGGCGCCGGTGGCCGTGCGGGCGGCGCCATGGCCAACCTCCAGCGCGGCG
TTGGCTCCCTCGTCCGTGACATTGGCGACCCCTGCCTCAACCCATCCCCCGTTAAGGGGAGCAAAATGCT
CAAACCGGAAAAATGGCACACATGTTTTGATAATGATGGAAAGGTCATAGGTTTCCGTAAAGCCCTAAAA
TTCATTGTCTTAGGGGGTGTGGATCCCACTATTCGAGCTGAAGTTTGGGAATTTCTTCTTGGCTGCTATG
CCTTGAGTAGTACCTCAGAGTATAGGAGGAAACTAAGAGCTGTTAGAAGGGAAAAATATCAAATTTTAGT
TAGACAGTGCCAGAGCATGCACCCAAGCATTGGTACAGGTGAGCTTGCTTACGCTGTTGGATCAAAGCTA

Now, fileA contains selected transcript numbers in the first column and fileB contains sequences of all transcripts. I want to scan fileB for the first column of fileA and print the trailing sequences of matching transcripts along with the transcript number.

Best Answer

sed -n 's|^\(TCONS_[0-9]*\) .*|\
         /^>\1 /,/\\n>/P|p' fileA |
sed -e '$!N' -f - -e D fileB >outfile

...that should accomplish what you're after. For every line in fileA that begins with the string TCONS_ and any digits followed by a space character the first sed will print a line like:

/^>TCONS_000001 /,/\n>/P

...where the 000001 is whatever the generating line's numeric sequence is.

The second sed is given three scripts - all of which it applies to its named input file - fileB.

The first is $!N which instructs it to append the Next input line to pattern space for every line !but the $last.
The next is stdin - -f - - which the first sed constructs for it as noted.
- Each line the first sed prints at the second is a 2-address range//,// instructing the second sed to Print up to the first \newline in pattern space for every line falling between the two addresses.
The last script is just D which instructs sed to Delete up to the first occurring \newline in pattern space and try the three scripts again.

The result is that a match for any range the first sed scripts for the second begins when the block heading is at the ^head of pattern space - an iteration after it is pulled in w/ Next and the previous line is Deleted - and ends when the next block heading first appears and is still trailing pattern space as delimited by a \newline character. Because the second sed never Prints further than that delimiter, sed slides through fileB on a one-line lookahead - printing all blocks headed by strings found in column 1 of fileA and nothing else.

ANOTHER (more complicated/efficient) METHOD

Polished up:

also_fasta()( IFS='
';  get_match_lines(){
        tr   -s  '[:space:]'    \\n  |
        grep "$1" | grep -nFf - "$3"
    }
    get_script(){
        set \\ '\1,$p;n\' '1,\1b\1\' \
               '/^>/!b\1|p' '$c\' q
        sed -n "s|\([^:]*\).*|:\1$*"
    }
    get_match_lines "$@" < "$2"      |
    get_script | sed -nf - "$3"
)

If you define that function in your shell and call it like:

also_fasta 'TCONS_[0-9]*' fileA fileB

...it should do the trick.

I came up with that after doing some tests. While the first definitely works, it is also definitely slow for large inputs.

After copying all of the [ACGT]* lines in your example to my clipboard, I created sample data like:

seq -w -s " gene=XLOC_dummy
$(        xsel -bo | sed 's/ *$//')
>TCONS_"  0 512 99999999 >/tmp/temp

The above uses seq to write blocks like:

>TCONS_99993600 gene=XLOC_dummy
AGATGAGCTGGTGGGGATGCTCTAAGAGAACGAGAGAAGCACAGAGCAGATAAACCACACCCACAGGCAC
CACCGTCCTTGTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAG
ATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAG
TTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAAACCCTTGAACCAAGGAGTCCTCGCTG
AGGAAGCTTTGGATCCACGACGCAGCTATGGCCTCCCCGCCCACCAGGCCGCCAGCCACAACCAGCTGAC
TAGGTCGCATGCATCATCAGATTTCAATCTCCCTTCGTTCCCTGTCCCTAATCCAATACCAATAGGGAGC
AATCAGCTGCTCCTCGACGGCGAGGGAGATGTCGTCGGCCGCGGGCCAAGACAACGGAGATACCGCTGGG
GACTACATCAAGTGGATGTGCGGCGCCGGTGGCCGTGCGGGCGGCGCCATGGCCAACCTCCAGCGCGGCG
TTGGCTCCCTCGTCCGTGACATTGGCGACCCCTGCCTCAACCCATCCCCCGTTAAGGGGAGCAAAATGCT
CAAACCGGAAAAATGGCACACATGTTTTGATAATGATGGAAAGGTCATAGGTTTCCGTAAAGCCCTAAAA
TTCATTGTCTTAGGGGGTGTGGATCCCACTATTCGAGCTGAAGTTTGGGAATTTCTTCTTGGCTGCTATG
CCTTGAGTAGTACCTCAGAGTATAGGAGGAAACTAAGAGCTGTTAGAAGGGAAAAATATCAAATTTTAGT
TAGACAGTGCCAGAGCATGCACCCAAGCATTGGTACAGGTGAGCTTGCTTACGCTGTTGGATCAAAGCTA

... to the file /tmp/temp where the numeric portion of the >TCONS_[0-9]* string is incremented from 00000512 through 99999999 at an interval of 512 per block. The whole file came to about 185mbs give or take.

I then did:

grep -F 00\  </tmp/temp | cut -d\> -f2 >/tmp/tempA

...for the pattern file (which just narrows the selection to any TCONS_[0-9]*00<space> matches).

The line counts for both files were:

wc -l /tmp/temp*
  2734369 /tmp/temp
     7813 /tmp/tempA
  2742182 total

While I didn't run the two seds against input of even that size - the largest I tried with the sed | sed suggestion was a data file a tithe of that size - and a pattern file selected on 2\ - which does come close to this, though.

Anyway, I thought about what was going on and I realized that with a pattern file approaching 8000 regexps then every time a line is Deleted it's got to work it's way back through checking all of those regexps that came before - over and over again. This is probably not optimally done.

At least in my generated input though I did have one thing going for me - it was more or less sequential. Following that thread I realized it didn't even have to be sequential if I could work by line number rather than regexp - so I turned to grep.

The command I ran (and the basis for my function above) is:

sed  -n 's|^\(TCONS_[0-9]*\) .*|>\1 |p
'    < /tmp/tempA              |
grep -nFf - /tmp/temp          |
sed  -n 's|\([^:]*\):.*|:\1\
        \1,$p;n;1,\1b\1\
        /^>/!b\1|p;$c\' -e q   |
sed  -nf - /tmp/temp

If you use this you should substitute in fileB for /tmp/temp and fileA for /tmp/tempA.

grep -F works with Fixed strings rather than regexp - and is far faster (especially considering we're working w/ head-of-line matches) - and the rest basically ignores all of grep's results excepting the line numbers it returns at the head of each. The sed in the middle there massages grep's output in such a way that the sed which winds up processing the data file proper will never backtrack its script even once - it will progress through its script file as it progresses through its input.

The sed in the middle writes something like the following at the last sed for every line grep prints:

:2732801                     #define branch label :LINENO
2732801,$p                   #if LINENO thru last line print
n                            #overwrite current line w/ next line
1,2732801b2732801            #if first line thru LINENO branch to :LINENO
/^>/!b2732801                #if !not /^>/ branch to :LINENO

So for every one of grep's matches sed implements two tiny little read loops: the first is a noop in which sed overwrites the current line with the next until it has incremented the current line number to grep's next match. The second is a print loop in which sed continually prints the current line then overwrites it with the next until a line matches /^>/. In this way sed processes its script in lockstep with its infile - never advancing any further in the script than the next matching line number will allow.

This outperforms the other script by several orders of magnitude. It processes the 185mbs like...

( also_fasta 'TCONS_[0-9]*' /tmp/tempA /tmp/temp; ) \
   3.93s user 0.39s system 100% cpu 4.281 total

...where the other handled input 10% of that size in...

( sed -n 's|^\(TCONS_[0-9]*\) .*|/>\1 /,/\\n>/P|p' /tmp/tempA  |sed -e  -f... ) \
108.58s user 0.04s system 100% cpu 1:48.56 total

More specifically, the line counts from the other script's input were:

wc -l /tmp/temp*
  273435 /tmp/temp
     782 /tmp/tempA
  274217 total

Best Answer

ANOTHER (more complicated/efficient) METHOD

Related Solutions

Identifying genes from a list of genes

Related Question