Postgresql – Consequences of cascade drop on sequences/functions/views

postgresql

I am migrating some data to another server while dropping all sequences/functions/views. When I try to do simply

DROP VIEW | SEQUENCE | FUNCTION | AGGREGATE without CASCADE, errors are thrown to indicate dependencies. The goal is that if no data (=tables?) will be affected by using CASCADE, then I can use it.

Assumption: if no table is dependent on views/functions/sequences, then no table will be affected by using CASCADE.

Then I guess using CASCADE for views will be fine since they always depend on some base tables.

For functions and sequences I am not very sure. I haven't written that many of sql functions myself and I don't know if the data in a table are generated by some function, whether it means the table depends on the function.

For sequences, it seems that a sequence will be generated automatically by postgres if I use serial as column type. My first guess would be that in this case, the table that is using serial depends on the sequence. However, by testing on a dummy database on my laptop, I found that the data seems to be unaffected. (I just use SELECT COUNT(*) FROM TABLE_NAME after dropping the sequence)

Best Answer

I am quite certain that no table would be dropped if you remove views, sequences and functions with the CASCADE option.

But parts of a table may vanish:

Indexes, triggers, rules, column defaults, check constraints and in PostgreSQL v12 generated columns may vanish if you drop the functions and operators used in them.
Column defaults will vanish with the sequences used in them.

Dropping views should not affect tables at all.

Disclaimer: I have probably forgotten one or two cases.

Why don't you just selectively dump the parts of the database that you need with the -l and -L options of pg_restore?

Related Solutions

Postgresql – Biological Sequences of UniProt in PostgreSQL

Exploring the functions at PostBio it looks like they have a couple of ways of encoding. However, given that those extensions are optimized for searching, they make multiple references to simply using the text data type.

According to the documentation:

Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB.

Therefore, by putting the table into its own very large tablespace on dedicated hardware should be sufficient for your performance goals. If 1 GB is too small for your data, the int_interval from ProtBio should provide excellent performance:

A sequence feature corresponds to a triplet (id, orient, ii) where id is a sequence identifier (possibly the primary key for a sequence table), orient is a boolean indicating if the feature is in the same or contrary orientation of the sequence, and ii is the int_interval representing the feature as a subsequence.

Encoding the sequence in sha1 looks to be a very painful way of making a GUID, considering the potential lengths of the sequence.

If the different sequences are unrelated, store them on different tablespaces on different disks for maximum performance.

Postgresql – Find rows where integer sequence contains a given subsequence

If you are looking for significant performance improvements to dnoeth's answer, consider using a native C-function and creating the appropriate operator.

Here is an example for int4 arrays. (A generic array variant and the corresponding SQL script).

Datum
_int_sequence_contained(PG_FUNCTION_ARGS)
{
    return DirectFunctionCall2(_int_contains_sequence,
                               PG_GETARG_DATUM(1),
                               PG_GETARG_DATUM(0));
}

Datum
_int_contains_sequence(PG_FUNCTION_ARGS)
{
    ArrayType  *a = PG_GETARG_ARRAYTYPE_P(0);
    ArrayType  *b = PG_GETARG_ARRAYTYPE_P(1);
    int         na, nb;
    int32      *pa, *pb;
    int         i, j;

    na = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));
    nb = ArrayGetNItems(ARR_NDIM(b), ARR_DIMS(b));
    pa = (int32 *) ARR_DATA_PTR(a);
    pb = (int32 *) ARR_DATA_PTR(b);

    /* The naive searching algorithm. Replace it with a better one if your arrays are quite large. */
    for (i = 0; i <= na - nb; ++i)
    {
        for (j = 0; j < nb; ++j)
            if (pa[i + j] != pb[j])
                break;

        if (j == nb)
            PG_RETURN_BOOL(true);
    }

    PG_RETURN_BOOL(false);
}

CREATE FUNCTION _int_contains_sequence(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;

CREATE FUNCTION _int_sequence_contained(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;

CREATE OPERATOR @@> (
  LEFTARG = _int4,
  RIGHTARG = _int4,
  PROCEDURE = _int_contains_sequence,
  COMMUTATOR = '<@@',
  RESTRICT = contsel,
  JOIN = contjoinsel
);

CREATE OPERATOR <@@ (
  LEFTARG = _int4,
  RIGHTARG = _int4,
  PROCEDURE = _int_sequence_contained,
  COMMUTATOR = '@@>',
  RESTRICT = contsel,
  JOIN = contjoinsel
);

Now you can filter rows like this.

SELECT * FROM sequences WHERE sequence @@> '{12, 742, 225, 547}'

I have conducted a little experiment to find how much faster this solution is.

CREATE TEMPORARY TABLE sequences AS
SELECT array_agg((random() * 10)::int4) AS sequence, g1 AS id
FROM generate_series(1, 100000) g1
  CROSS JOIN generate_series(1, 30) g2
GROUP BY g1;

EXPLAIN ANALYZE SELECT * FROM sequences
WHERE        translate(cast(sequence as text), '{}',',,')
 LIKE '%' || translate(cast('{1,2,3,4}'as text), '{}',',,') || '%'

"Seq Scan on sequences  (cost=0.00..7869.42 rows=28 width=36) (actual time=2.487..334.318 rows=251 loops=1)"
"  Filter: (translate((sequence)::text, '{}'::text, ',,'::text) ~~ '%,1,2,3,4,%'::text)"
"  Rows Removed by Filter: 99749"
"Planning time: 0.104 ms"
"Execution time: 334.365 ms"

EXPLAIN ANALYZE SELECT * FROM sequences WHERE sequence @@> '{1,2,3,4}'

"Seq Scan on sequences  (cost=0.00..5752.01 rows=282 width=36) (actual time=0.178..20.792 rows=251 loops=1)"
"  Filter: (sequence @@> '{1,2,3,4}'::integer[])"
"  Rows Removed by Filter: 99749"
"Planning time: 0.091 ms"
"Execution time: 20.859 ms"

So, it is about 16 times faster. If it is not enough, you can add support for GIN or GiST indexes, but this will be much more difficult task.

Best Answer

Related Solutions

Postgresql – Biological Sequences of UniProt in PostgreSQL

Postgresql – Find rows where integer sequence contains a given subsequence

Related Question