Unix – Get Characters 10 to 80 in a File

awktext processingwc

I have a file containing line-separated text:

GCAACACGGTGGGAGCACGTCAACAAGGAGTAATTCTTCAAGACCGTTCCAAAAACAGCATGCAAGAGCG
GTCGAGCCTAGTCCATCAGCAAATGCCGTTTCCAGCAATGCAAAGAGAACGGGAAGGTATCAGTTCACCG
GTGACTGCCATTACTGTGGACAAAAAGGGCACATGAAGAGAGACTGTGACAAGCTAAAGGCAGATGTAGC

From this, I want to extract characters 10 to 80, so:

TGGGAGCACGTCAACAAGGAGTAATTCTTCAAGACCGTTCCAAAAACAGCATGCAAGAGCG
GTCGAGCCT

I have found how to count the characters in a file:

  wc -m file

and how to get a number of characters per line:

 awk '{print substr($0,2,6)}' file

but I cannot find a way to get the characters 10 to 80.

Newlines do not count as characters.

Any ideas?

Yes, this is DNA, from a full genome. I have extracted this bit of DNA from a fasta file containing different scaffolds (10 and 11 in this case) using

 awk '/scaffold_10\>/{p=1;next} /scaffold_11/{p=0;exit} p'

Ultimately, I would like to have a simple command to get characters 100 to 800 (or something like that) from that specified scaffold.

EDIT: Question continues here: use gff2fasta instead of a bash script to get parts of DNA sequences out of a full genome

Best Answer

I wonder how the line feed in the file should be handled. Does that count as a character or not?

If we just should take from byte 10 and print 71 bytes (A,C,T,G and linefeed) then Sato Katsura solution is the fastest (here assuming GNU dd or compatible for status=none, replace with 2> /dev/null (though that would also hide error messages if any) with other implementations):

 dd if=file bs=1 count=71 skip=9 status=none

If the line feed should be skipped then filter them out with tr -d '\n':

 tr -d '\n' < file | dd bs=1 count=70 skip=9 status=none

If the Fasta-header should be skipped it is:

 grep -v '^[;>]' file | tr -d '\n' | dd bs=1 count=70 skip=9 status=none

grep -v '^[;>]' file means skip all lines that start with ; or >.

Related Question