If you are looking for significant performance improvements to dnoeth's answer, consider using a native C-function and creating the appropriate operator.
Here is an example for int4 arrays. (A generic array variant and the corresponding SQL script).
Datum
_int_sequence_contained(PG_FUNCTION_ARGS)
{
return DirectFunctionCall2(_int_contains_sequence,
PG_GETARG_DATUM(1),
PG_GETARG_DATUM(0));
}
Datum
_int_contains_sequence(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P(0);
ArrayType *b = PG_GETARG_ARRAYTYPE_P(1);
int na, nb;
int32 *pa, *pb;
int i, j;
na = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));
nb = ArrayGetNItems(ARR_NDIM(b), ARR_DIMS(b));
pa = (int32 *) ARR_DATA_PTR(a);
pb = (int32 *) ARR_DATA_PTR(b);
/* The naive searching algorithm. Replace it with a better one if your arrays are quite large. */
for (i = 0; i <= na - nb; ++i)
{
for (j = 0; j < nb; ++j)
if (pa[i + j] != pb[j])
break;
if (j == nb)
PG_RETURN_BOOL(true);
}
PG_RETURN_BOOL(false);
}
CREATE FUNCTION _int_contains_sequence(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE FUNCTION _int_sequence_contained(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE OPERATOR @@> (
LEFTARG = _int4,
RIGHTARG = _int4,
PROCEDURE = _int_contains_sequence,
COMMUTATOR = '<@@',
RESTRICT = contsel,
JOIN = contjoinsel
);
CREATE OPERATOR <@@ (
LEFTARG = _int4,
RIGHTARG = _int4,
PROCEDURE = _int_sequence_contained,
COMMUTATOR = '@@>',
RESTRICT = contsel,
JOIN = contjoinsel
);
Now you can filter rows like this.
SELECT * FROM sequences WHERE sequence @@> '{12, 742, 225, 547}'
I have conducted a little experiment to find how much faster this solution is.
CREATE TEMPORARY TABLE sequences AS
SELECT array_agg((random() * 10)::int4) AS sequence, g1 AS id
FROM generate_series(1, 100000) g1
CROSS JOIN generate_series(1, 30) g2
GROUP BY g1;
EXPLAIN ANALYZE SELECT * FROM sequences
WHERE translate(cast(sequence as text), '{}',',,')
LIKE '%' || translate(cast('{1,2,3,4}'as text), '{}',',,') || '%'
"Seq Scan on sequences (cost=0.00..7869.42 rows=28 width=36) (actual time=2.487..334.318 rows=251 loops=1)"
" Filter: (translate((sequence)::text, '{}'::text, ',,'::text) ~~ '%,1,2,3,4,%'::text)"
" Rows Removed by Filter: 99749"
"Planning time: 0.104 ms"
"Execution time: 334.365 ms"
EXPLAIN ANALYZE SELECT * FROM sequences WHERE sequence @@> '{1,2,3,4}'
"Seq Scan on sequences (cost=0.00..5752.01 rows=282 width=36) (actual time=0.178..20.792 rows=251 loops=1)"
" Filter: (sequence @@> '{1,2,3,4}'::integer[])"
" Rows Removed by Filter: 99749"
"Planning time: 0.091 ms"
"Execution time: 20.859 ms"
So, it is about 16 times faster. If it is not enough, you can add support for GIN or GiST indexes, but this will be much more difficult task.
Best Answer
Walkthrough
Identify restarts by comparing current sequence to previous sequence (
LAG
).Do "running counts" (similar to "Running totals") of restarts (
is_restart
).Rows that belongs to the same group will have the same count (AKA
restart_id
).The "Order by" in
COUNT
impliesrange between unbounded preceding and current row
Group by
log_id
andrestart_id
and aggregate sequence