I have a file containing line-separated text:
GCAACACGGTGGGAGCACGTCAACAAGGAGTAATTCTTCAAGACCGTTCCAAAAACAGCATGCAAGAGCG
GTCGAGCCTAGTCCATCAGCAAATGCCGTTTCCAGCAATGCAAAGAGAACGGGAAGGTATCAGTTCACCG
GTGACTGCCATTACTGTGGACAAAAAGGGCACATGAAGAGAGACTGTGACAAGCTAAAGGCAGATGTAGC
From this, I want to extract characters 10 to 80, so:
TGGGAGCACGTCAACAAGGAGTAATTCTTCAAGACCGTTCCAAAAACAGCATGCAAGAGCG
GTCGAGCCT
I have found how to count the characters in a file:
wc -m file
and how to get a number of characters per line:
awk '{print substr($0,2,6)}' file
but I cannot find a way to get the characters 10 to 80.
Newlines do not count as characters.
Any ideas?
Yes, this is DNA, from a full genome. I have extracted this bit of DNA from a fasta file containing different scaffolds (10 and 11 in this case) using
awk '/scaffold_10\>/{p=1;next} /scaffold_11/{p=0;exit} p'
Ultimately, I would like to have a simple command to get characters 100 to 800 (or something like that) from that specified scaffold.
EDIT: Question continues here: use gff2fasta instead of a bash script to get parts of DNA sequences out of a full genome
Best Answer
I wonder how the line feed in the file should be handled. Does that count as a character or not?
If we just should take from byte 10 and print 71 bytes (A,C,T,G and linefeed) then Sato Katsura solution is the fastest (here assuming GNU
dd
or compatible forstatus=none
, replace with2> /dev/null
(though that would also hide error messages if any) with other implementations):If the line feed should be skipped then filter them out with
tr -d '\n'
:If the Fasta-header should be skipped it is:
grep -v '^[;>]' file
means skip all lines that start with;
or>
.