Postgresql – Biological Sequences of UniProt in PostgreSQL

postgresql

What is the best way to store UniProt biological sequences in PostreSQL?

Data Details

We pull in 12 million sequences from UniProt – this number is likely to double every 3-10 month.
The length of a sequence can vary from 10 to 50 billion characters
Less than 1% of the sequences are longer than 10 thousand characters
- Would it improve performance to store the longer sequences separately?
A sequence can be of either Protein or DNA alphabet
- The DNA alphabet has 5 characters (A, T, C, G, or -).
- The Protein alphabet will have around 30 characters.
- We don't mind storing the sequences of the two different alphabets in different columns or even different tables. Would that help?

Data Access Details

To answer Jeremiah Peschka's comment:

Protein and DNA sequences would be accessed at different times
Would not need to search within the sequence (that's done outside of db)
Would ether access single rows at a time or pull out sets of rows by IDs. We would not need to scan rows. All sequences are referenced by other tables – several biologically and chronologically meaningful hierarchies exist in the database.

Backwards Compatibility

It would be nice to be able to continue to be able to apply the following hashing function (SEGUID – SEquence Globally Unique IDentifier) to the sequences.

CREATE OR REPLACE FUNCTION gfam.get_seguid(p_sequence character varying)
  RETURNS character varying AS
$BODY$
declare
  result varchar := null;
  x integer;
begin

  select encode(gfam.digest(p_sequence, 'sha1'), 'base64')
  into   result;

  x := length(result);
  if substring(result from x for 1) = '=' then

     result := substring( result from 1 for x-1 );

  end if;

  return result;

end;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE
  COST 100;

Best Answer

Exploring the functions at PostBio it looks like they have a couple of ways of encoding. However, given that those extensions are optimized for searching, they make multiple references to simply using the text data type.

According to the documentation:

Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB.

Therefore, by putting the table into its own very large tablespace on dedicated hardware should be sufficient for your performance goals. If 1 GB is too small for your data, the int_interval from ProtBio should provide excellent performance:

A sequence feature corresponds to a triplet (id, orient, ii) where id is a sequence identifier (possibly the primary key for a sequence table), orient is a boolean indicating if the feature is in the same or contrary orientation of the sequence, and ii is the int_interval representing the feature as a subsequence.

Encoding the sequence in sha1 looks to be a very painful way of making a GUID, considering the potential lengths of the sequence.

If the different sequences are unrelated, store them on different tablespaces on different disks for maximum performance.

Related Solutions

Postgresql – When postgresql sequences get out of sync

There are a few possible cases where things can get out of sync.

Very old versions (unsupported) used to sometimes fail to set sequences on backup restore. If you have manual backup and restore routines, this is somewhere to look.
setval('sequence_name', 1) will set to to 1.

Those are your only two possibilities unless you have a short cycle, and are cycling.

PostgreSQL – How to Auto-Generate Sequences When Importing from SQL Server

In CREATE table you can use bigserial/serial type

CREATE TABLE xx (
 id  serial,
 data text
);

if you need to create sequence then CREATE SEQUENCE serial START 1;

"Simple" way to import data from MSSQL is following CREATE TABLE and replace IDENTITY with serial Import all data. If you dont import id values, Postgresql replaces them with with new values. Example

INSERT into xx (data) VALUES ('test') ;

This creates new id for row

 INSERT into xx (id, data) VALUES (1, 'test') ;

this one does not. So import data with id and run

 select setval('xx_id_seq', max(id)) FROM xx

xx_id_seq is default naming convention where xx is table name id is column and seq id for sequence ( in my system pgadmin3 creates table with that logic)

If you want to automate whole process you need to write lot of code which reads INFORMATION_SCHEMA and does dynamic SQL , but i assume that it is not the question.

For more information see manual

sequences see : http://www.postgresql.org/docs/9.3/static/sql-createsequence.html

manipulating sequnces see: http://www.postgresql.org/docs/9.3/static/functions-sequence.html

datatypes see: http://www.postgresql.org/docs/9.3/static/datatype.html