I have 2 files. The first, fileA
looks like
TCONS_00000066 XLOC_000030 - u q1:XLOC_000030|TCONS_00000066|0|0.000000|0.000000|0.000000|0.000000|-
TCONS_00000130 XLOC_000057 - u q1:XLOC_000057|TCONS_00000130|0|0.000000|0.000000|0.000000|0.000000|-
TCONS_00000395 XLOC_000206 - u q1:XLOC_000204|TCONS_00000393|0|0.000000|0.000000|0.000000|0.000000|-
FileB
looks like:
>TCONS_00000001 gene=XLOC_000001
AGATGAGCTGGTGGGGATGCTCTAAGAGAACGAGAGAAGCACAGAGCAGATAAACCACACCCACAGGCAC
CACCGTCCTTGTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAG
ATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAG
TTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAAACCCTTGAACCAAGGAGTCCTCGCTG
AGGAAGCTTTGGATCCACGACGCAGCTATGGCCTCCCCGCCCACCAGGCCGCCAGCCACAACCAGCTGAC
TAGGTCGCATGCATCATCAGATTTCAATCTCCCTTCGTTCCCTGTCCCTAATCCAATACCAATAGGGAGC
AATCAGCTGCTCCTCGACGGCGAGGGAGATGTCGTCGGCCGCGGGCCAAGACAACGGAGATACCGCTGGG
GACTACATCAAGTGGATGTGCGGCGCCGGTGGCCGTGCGGGCGGCGCCATGGCCAACCTCCAGCGCGGCG
TTGGCTCCCTCGTCCGTGACATTGGCGACCCCTGCCTCAACCCATCCCCCGTTAAGGGGAGCAAAATGCT
CAAACCGGAAAAATGGCACACATGTTTTGATAATGATGGAAAGGTCATAGGTTTCCGTAAAGCCCTAAAA
TTCATTGTCTTAGGGGGTGTGGATCCCACTATTCGAGCTGAAGTTTGGGAATTTCTTCTTGGCTGCTATG
CCTTGAGTAGTACCTCAGAGTATAGGAGGAAACTAAGAGCTGTTAGAAGGGAAAAATATCAAATTTTAGT
TAGACAGTGCCAGAGCATGCACCCAAGCATTGGTACAGGTGAGCTTGCTTACGCTGTTGGATCAAAGCTA
Now, fileA
contains selected transcript numbers in the first column and fileB
contains sequences of all transcripts. I want to scan fileB
for the first column of fileA
and print the trailing sequences of matching transcripts along with the transcript number.
Best Answer
...that should accomplish what you're after. For every line in fileA that begins with the string
TCONS_
and any digits followed by a space character the firstsed
will print a line like:...where the
000001
is whatever the generating line's numeric sequence is.The second
sed
is given three scripts - all of which it applies to its named input file - fileB.The first is
$!N
which instructs it to append theN
ext input line to pattern space for every line!
but the$
last.The next is stdin -
-f -
- which the firstsed
constructs for it as noted.sed
prints at the second is a 2-address range//,//
instructing the secondsed
toP
rint up to the first\n
ewline in pattern space for every line falling between the two addresses.The last script is just
D
which instructssed
toD
elete up to the first occurring\n
ewline in pattern space and try the three scripts again.The result is that a match for any range the first
sed
scripts for the second begins when the block heading is at the^
head of pattern space - an iteration after it is pulled in w/N
ext and the previous line isD
eleted - and ends when the next block heading first appears and is still trailing pattern space as delimited by a\n
ewline character. Because the secondsed
neverP
rints further than that delimiter,sed
slides through fileB on a one-line lookahead - printing all blocks headed by strings found in column 1 of fileA and nothing else.ANOTHER (more complicated/efficient) METHOD
Polished up:
If you define that function in your shell and call it like:
...it should do the trick.
I came up with that after doing some tests. While the first definitely works, it is also definitely slow for large inputs.
After copying all of the
[ACGT]*
lines in your example to my clipboard, I created sample data like:The above uses
seq
to write blocks like:... to the file
/tmp/temp
where the numeric portion of the>TCONS_[0-9]*
string is incremented from00000512
through99999999
at an interval of 512 per block. The whole file came to about 185mbs give or take.I then did:
...for the pattern file (which just narrows the selection to any
TCONS_[0-9]*00<space>
matches).The line counts for both files were:
While I didn't run the two
sed
s against input of even that size - the largest I tried with thesed | sed
suggestion was a data file a tithe of that size - and a pattern file selected on2\
- which does come close to this, though.Anyway, I thought about what was going on and I realized that with a pattern file approaching 8000 regexps then every time a line is
D
eleted it's got to work it's way back through checking all of those regexps that came before - over and over again. This is probably not optimally done.At least in my generated input though I did have one thing going for me - it was more or less sequential. Following that thread I realized it didn't even have to be sequential if I could work by line number rather than regexp - so I turned to
grep
.The command I ran (and the basis for my function above) is:
If you use this you should substitute in
fileB
for/tmp/temp
andfileA
for/tmp/tempA
.grep -F
works with Fixed strings rather than regexp - and is far faster (especially considering we're working w/ head-of-line matches) - and the rest basically ignores all ofgrep
's results excepting the line numbers it returns at the head of each. Thesed
in the middle there massagesgrep
's output in such a way that thesed
which winds up processing the data file proper will never backtrack its script even once - it will progress through its script file as it progresses through its input.The
sed
in the middle writes something like the following at the lastsed
for every linegrep
prints:So for every one of
grep
's matchessed
implements two tiny little read loops: the first is a noop in whichsed
overwrites the current line with the next until it has incremented the current line number togrep
's next match. The second is a print loop in whichsed
continually prints the current line then overwrites it with the next until a line matches/^>/
. In this waysed
processes its script in lockstep with its infile - never advancing any further in the script than the next matching line number will allow.This outperforms the other script by several orders of magnitude. It processes the 185mbs like...
...where the other handled input 10% of that size in...
More specifically, the line counts from the other script's input were: