Postgresql – Biological Sequences of UniProt in PostgreSQL

postgresql

What is the best way to store UniProt biological sequences in PostreSQL?

Data Details

  • We pull in 12 million sequences from UniProt – this number is likely to double every 3-10 month.
  • The length of a sequence can vary from 10 to 50 billion characters
  • Less than 1% of the sequences are longer than 10 thousand characters
    • Would it improve performance to store the longer sequences separately?
  • A sequence can be of either Protein or DNA alphabet
    • The DNA alphabet has 5 characters (A, T, C, G, or -).
    • The Protein alphabet will have around 30 characters.
    • We don't mind storing the sequences of the two different alphabets in different columns or even different tables. Would that help?

Data Access Details

To answer Jeremiah Peschka's comment:

  • Protein and DNA sequences would be accessed at different times
  • Would not need to search within the sequence (that's done outside of db)
  • Would ether access single rows at a time or pull out sets of rows by IDs. We would not need to scan rows. All sequences are referenced by other tables – several biologically and chronologically meaningful hierarchies exist in the database.

Backwards Compatibility

It would be nice to be able to continue to be able to apply the following hashing function (SEGUID – SEquence Globally Unique IDentifier) to the sequences.

CREATE OR REPLACE FUNCTION gfam.get_seguid(p_sequence character varying)
  RETURNS character varying AS
$BODY$
declare
  result varchar := null;
  x integer;
begin

  select encode(gfam.digest(p_sequence, 'sha1'), 'base64')
  into   result;

  x := length(result);
  if substring(result from x for 1) = '=' then

     result := substring( result from 1 for x-1 );

  end if;

  return result;

end;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE
  COST 100;

Best Answer

Exploring the functions at PostBio it looks like they have a couple of ways of encoding. However, given that those extensions are optimized for searching, they make multiple references to simply using the text data type.

According to the documentation:

Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB.

Therefore, by putting the table into its own very large tablespace on dedicated hardware should be sufficient for your performance goals. If 1 GB is too small for your data, the int_interval from ProtBio should provide excellent performance:

A sequence feature corresponds to a triplet (id, orient, ii) where id is a sequence identifier (possibly the primary key for a sequence table), orient is a boolean indicating if the feature is in the same or contrary orientation of the sequence, and ii is the int_interval representing the feature as a subsequence.

Encoding the sequence in sha1 looks to be a very painful way of making a GUID, considering the potential lengths of the sequence.

If the different sequences are unrelated, store them on different tablespaces on different disks for maximum performance.