Unix – Get Characters 10 to 80 in a File

awktext processingwc

I have a file containing line-separated text:

GCAACACGGTGGGAGCACGTCAACAAGGAGTAATTCTTCAAGACCGTTCCAAAAACAGCATGCAAGAGCG
GTCGAGCCTAGTCCATCAGCAAATGCCGTTTCCAGCAATGCAAAGAGAACGGGAAGGTATCAGTTCACCG
GTGACTGCCATTACTGTGGACAAAAAGGGCACATGAAGAGAGACTGTGACAAGCTAAAGGCAGATGTAGC

From this, I want to extract characters 10 to 80, so:

TGGGAGCACGTCAACAAGGAGTAATTCTTCAAGACCGTTCCAAAAACAGCATGCAAGAGCG
GTCGAGCCT

I have found how to count the characters in a file:

  wc -m file

and how to get a number of characters per line:

 awk '{print substr($0,2,6)}' file

but I cannot find a way to get the characters 10 to 80.

Newlines do not count as characters.

Any ideas?

Yes, this is DNA, from a full genome. I have extracted this bit of DNA from a fasta file containing different scaffolds (10 and 11 in this case) using

 awk '/scaffold_10\>/{p=1;next} /scaffold_11/{p=0;exit} p'

Ultimately, I would like to have a simple command to get characters 100 to 800 (or something like that) from that specified scaffold.

EDIT: Question continues here: use gff2fasta instead of a bash script to get parts of DNA sequences out of a full genome

Best Answer

I wonder how the line feed in the file should be handled. Does that count as a character or not?

If we just should take from byte 10 and print 71 bytes (A,C,T,G and linefeed) then Sato Katsura solution is the fastest (here assuming GNU dd or compatible for status=none, replace with 2> /dev/null (though that would also hide error messages if any) with other implementations):

 dd if=file bs=1 count=71 skip=9 status=none

If the line feed should be skipped then filter them out with tr -d '\n':

 tr -d '\n' < file | dd bs=1 count=70 skip=9 status=none

If the Fasta-header should be skipped it is:

 grep -v '^[;>]' file | tr -d '\n' | dd bs=1 count=70 skip=9 status=none

grep -v '^[;>]' file means skip all lines that start with ; or >.

Related Solutions

Bash – counting multiple patterns in a single pass with grep

IFS=$'\n'
gzip -dc file.gz | grep -v '^>' | grep -Foe "${tri[*]}" | sort | uniq -c

But by the way, AAAC matches both AAA and AAC, but grep -o will output only one of them. Is that what you want? Also, how many occurrences of AAA in AAAAAA? 2 or 4 ([AAA]AAA, A[AAA]AA, AA[AAA]A, AAA[AAA])?

Maybe you want instead:

gzip -dc file.gz | grep -v '^>' | fold -w3 | grep -Fxe "${tri[*]}" | sort | uniq -c

That is split the lines in groups of 3 characters and count the occurrences as full lines (would find 0 occurrence of AAA in ACAAATTCG (as that's ACA AAT TCG)).

Or on the other hand:

gzip -dc file.gz | awk '
  BEGIN{n=ARGC;ARGC=0}
  !/^>/ {l = length - 2; for (i = 1; i <= l; i++) a[substr($0,i,3)]++}
  END{for (i=1;i<n;i++) printf "%s: %d\n", ARGV[i], a[ARGV[i]]}' "${tri[@]}"

(would find 4 occurrences of AAA in AAAAAA).

Shell – How to get the unique count of a particular part of a string

With grep, filter out just the numbers:

grep -Eo '[0-9]+-' file | sort -u | wc -l

[0-9] Matches any character between 0 and 9 (any digit).
+ in extended regular expressions stands for at least one character (that's why the -E option is used with grep). So [0-9]+- matches one or more digits, followed by -.
-o only prints the part that matched your pattern, so given input abcd23-gf56, grep will only print 23-.
sort -u sorts and filters unique entries (due to -u), and wc -l counts the number of lines in input (hence, the number of unique entries).

Best Answer

Related Solutions

Bash – counting multiple patterns in a single pass with grep

Shell – How to get the unique count of a particular part of a string

Related Question