What is the best way to store UniProt biological sequences in PostreSQL?
Data Details
- We pull in 12 million sequences from UniProt – this number is likely to double every 3-10 month.
- The length of a sequence can vary from 10 to 50 billion characters
- Less than 1% of the sequences are longer than 10 thousand characters
- Would it improve performance to store the longer sequences separately?
- A sequence can be of either Protein or DNA alphabet
- The DNA alphabet has 5 characters (A, T, C, G, or -).
- The Protein alphabet will have around 30 characters.
- We don't mind storing the sequences of the two different alphabets in different columns or even different tables. Would that help?
Data Access Details
To answer Jeremiah Peschka's comment:
- Protein and DNA sequences would be accessed at different times
- Would not need to search within the sequence (that's done outside of db)
- Would ether access single rows at a time or pull out sets of rows by IDs. We would not need to scan rows. All sequences are referenced by other tables – several biologically and chronologically meaningful hierarchies exist in the database.
Backwards Compatibility
It would be nice to be able to continue to be able to apply the following hashing function (SEGUID – SEquence Globally Unique IDentifier) to the sequences.
CREATE OR REPLACE FUNCTION gfam.get_seguid(p_sequence character varying)
RETURNS character varying AS
$BODY$
declare
result varchar := null;
x integer;
begin
select encode(gfam.digest(p_sequence, 'sha1'), 'base64')
into result;
x := length(result);
if substring(result from x for 1) = '=' then
result := substring( result from 1 for x-1 );
end if;
return result;
end;
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;
Best Answer
Exploring the functions at PostBio it looks like they have a couple of ways of encoding. However, given that those extensions are optimized for searching, they make multiple references to simply using the
text
data type.According to the documentation:
Therefore, by putting the table into its own very large tablespace on dedicated hardware should be sufficient for your performance goals. If 1 GB is too small for your data, the int_interval from ProtBio should provide excellent performance:
Encoding the sequence in sha1 looks to be a very painful way of making a GUID, considering the potential lengths of the sequence.
If the different sequences are unrelated, store them on different tablespaces on different disks for maximum performance.