Exploring the functions at PostBio it looks like they have a couple of ways of encoding. However, given that those extensions are optimized for searching, they make multiple references to simply using the text
data type.
According to the documentation:
Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB.
Therefore, by putting the table into its own very large tablespace on dedicated hardware should be sufficient for your performance goals. If 1 GB is too small for your data, the int_interval from ProtBio should provide excellent performance:
A sequence feature corresponds to a triplet (id, orient, ii) where id is a sequence identifier (possibly the primary key for a sequence table), orient is a boolean indicating if the feature is in the same or contrary orientation of the sequence, and ii is the int_interval representing the feature as a subsequence.
Encoding the sequence in sha1 looks to be a very painful way of making a GUID, considering the potential lengths of the sequence.
If the different sequences are unrelated, store them on different tablespaces on different disks for maximum performance.
If you are looking for significant performance improvements to dnoeth's answer, consider using a native C-function and creating the appropriate operator.
Here is an example for int4 arrays. (A generic array variant and the corresponding SQL script).
Datum
_int_sequence_contained(PG_FUNCTION_ARGS)
{
return DirectFunctionCall2(_int_contains_sequence,
PG_GETARG_DATUM(1),
PG_GETARG_DATUM(0));
}
Datum
_int_contains_sequence(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P(0);
ArrayType *b = PG_GETARG_ARRAYTYPE_P(1);
int na, nb;
int32 *pa, *pb;
int i, j;
na = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));
nb = ArrayGetNItems(ARR_NDIM(b), ARR_DIMS(b));
pa = (int32 *) ARR_DATA_PTR(a);
pb = (int32 *) ARR_DATA_PTR(b);
/* The naive searching algorithm. Replace it with a better one if your arrays are quite large. */
for (i = 0; i <= na - nb; ++i)
{
for (j = 0; j < nb; ++j)
if (pa[i + j] != pb[j])
break;
if (j == nb)
PG_RETURN_BOOL(true);
}
PG_RETURN_BOOL(false);
}
CREATE FUNCTION _int_contains_sequence(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE FUNCTION _int_sequence_contained(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE OPERATOR @@> (
LEFTARG = _int4,
RIGHTARG = _int4,
PROCEDURE = _int_contains_sequence,
COMMUTATOR = '<@@',
RESTRICT = contsel,
JOIN = contjoinsel
);
CREATE OPERATOR <@@ (
LEFTARG = _int4,
RIGHTARG = _int4,
PROCEDURE = _int_sequence_contained,
COMMUTATOR = '@@>',
RESTRICT = contsel,
JOIN = contjoinsel
);
Now you can filter rows like this.
SELECT * FROM sequences WHERE sequence @@> '{12, 742, 225, 547}'
I have conducted a little experiment to find how much faster this solution is.
CREATE TEMPORARY TABLE sequences AS
SELECT array_agg((random() * 10)::int4) AS sequence, g1 AS id
FROM generate_series(1, 100000) g1
CROSS JOIN generate_series(1, 30) g2
GROUP BY g1;
EXPLAIN ANALYZE SELECT * FROM sequences
WHERE translate(cast(sequence as text), '{}',',,')
LIKE '%' || translate(cast('{1,2,3,4}'as text), '{}',',,') || '%'
"Seq Scan on sequences (cost=0.00..7869.42 rows=28 width=36) (actual time=2.487..334.318 rows=251 loops=1)"
" Filter: (translate((sequence)::text, '{}'::text, ',,'::text) ~~ '%,1,2,3,4,%'::text)"
" Rows Removed by Filter: 99749"
"Planning time: 0.104 ms"
"Execution time: 334.365 ms"
EXPLAIN ANALYZE SELECT * FROM sequences WHERE sequence @@> '{1,2,3,4}'
"Seq Scan on sequences (cost=0.00..5752.01 rows=282 width=36) (actual time=0.178..20.792 rows=251 loops=1)"
" Filter: (sequence @@> '{1,2,3,4}'::integer[])"
" Rows Removed by Filter: 99749"
"Planning time: 0.091 ms"
"Execution time: 20.859 ms"
So, it is about 16 times faster. If it is not enough, you can add support for GIN or GiST indexes, but this will be much more difficult task.
Best Answer
I am quite certain that no table would be dropped if you remove views, sequences and functions with the
CASCADE
option.But parts of a table may vanish:
Indexes, triggers, rules, column defaults, check constraints and in PostgreSQL v12 generated columns may vanish if you drop the functions and operators used in them.
Column defaults will vanish with the sequences used in them.
Dropping views should not affect tables at all.
Disclaimer: I have probably forgotten one or two cases.
Why don't you just selectively dump the parts of the database that you need with the
-l
and-L
options ofpg_restore
?