My question follows that one: file(1) and magic(5) : describing other formats .
I want to describe a FASTA sequence ( http://en.wikipedia.org/wiki/FASTA_format)
It could be a DNA sequence (with only ATGC)
>header
ATGCTAGCATAGCATCGATGCTGTAGCTACGTAGCTACGTCTACG
A 'magic' pattern would be
>.*\n[ATGC]*
or a PROTEIN sequence ( ACDEFGHIKLMNPQRSTVWYBZX containing ATGC too)
>header
AHITKLMNPQRGHIKLMNPQRC
A 'magic' pattern would be
>.*\n[ACDEFGHIKLMNPQRSTVWYBZX]*
But whenever I use those regular expressions, file tells me that it's a protein because it matches the 2nd regex. Is there a way to prioritize a result ? Is there a way to proritize , something like "Don't try any other pattern if that one matches ? ".
Best Answer
You can set priorities using a "strength" value. From magic(5):
To lower the priority of the PROTEIN description, append this line:
...where
N
is big enough to take it below the score of the DNA description.The "currently computed default magic strength" of a test isn't immediately obvious, but you can use the
--list
flag to show them all. Alternatively, read the source -- the function responsible isapprentice_magic_strength
. It's calculated from the first test of the entry, so if you want to give one type a precedence over another, having identical first lines is helpful. (That way,N
only needs be 1.)One other problem: Your regexps aren't strict enough.
*
can match zero characters, so the pattern is found at the start of every line - protein, DNA or other. To tighten it up, confirm that the whole line consists only of the permitted characters:\n[ATGC]+$
, or\n[ATGC]{num,}$
(where num is the shortest pattern you expect to see)