File(1) and magic(5) : prioritizing a result

file-commandregular expression

My question follows that one: file(1) and magic(5) : describing other formats .

I want to describe a FASTA sequence ( http://en.wikipedia.org/wiki/FASTA_format)

It could be a DNA sequence (with only ATGC)

>header
ATGCTAGCATAGCATCGATGCTGTAGCTACGTAGCTACGTCTACG

A 'magic' pattern would be

>.*\n[ATGC]*

or a PROTEIN sequence ( ACDEFGHIKLMNPQRSTVWYBZX containing ATGC too)

>header
AHITKLMNPQRGHIKLMNPQRC

A 'magic' pattern would be

>.*\n[ACDEFGHIKLMNPQRSTVWYBZX]*

But whenever I use those regular expressions, file tells me that it's a protein because it matches the 2nd regex. Is there a way to prioritize a result ? Is there a way to proritize , something like "Don't try any other pattern if that one matches ? ".

Best Answer

You can set priorities using a "strength" value. From magic(5):

An optional strength can be supplied on a separate line which refers to the current magic description using the following format:

    !:strength OP VALUE

The operand OP can be: +, -, *, or / and VALUE is a constant between 0 and 255. This constant is applied using the specified operand to the currently computed default magic strength.

To lower the priority of the PROTEIN description, append this line:

!:strength - N

...where N is big enough to take it below the score of the DNA description.

The "currently computed default magic strength" of a test isn't immediately obvious, but you can use the --list flag to show them all. Alternatively, read the source -- the function responsible is apprentice_magic_strength. It's calculated from the first test of the entry, so if you want to give one type a precedence over another, having identical first lines is helpful. (That way, N only needs be 1.)

One other problem: Your regexps aren't strict enough. * can match zero characters, so the pattern is found at the start of every line - protein, DNA or other. To tighten it up, confirm that the whole line consists only of the permitted characters: \n[ATGC]+$, or \n[ATGC]{num,}$ (where num is the shortest pattern you expect to see)

0       string  =>header
>&0      regex   \n[ATGC]+$     DNA

0       string  =>header
>&0      regex   \n[ACDEFGHIKLMNPQRSTVWYBZX]+$  PROTEIN
!:strength - 1