Postgresql – generating all substrings (n-grams) for bit-strings

functionspostgresqlset-returning-functions

I followed SELECT all substrings (n-grams) of length n? to get a function for generating all n-grams for text types. Works great. I figured out how to cast my bit-strings to text, and the function from that link works.
However I need it to be as fast as possible, thus I wanted a bit-string native implementation, and I came up with this:

CREATE OR REPLACE FUNCTION public.ngrams(in input bit varying, in len integer)
 RETURNS SETOF bit
 LANGUAGE sql
AS $function$
SELECT cast((input << alpha) as bit(4))
FROM GENERATE_SERIES(0, LENGTH($1)-($2+1), 1) alpha;
$function$

This works, for N=4. However when I tried to change it to $2 or len I would get:

ERROR:  22P02: invalid input syntax for type integer: "len"
LINE 5: SELECT cast((input << alpha) as bit(len))
                                        ^

How come I can use the named parameter from the function declaration for input but not for len? Or better yet, how can I feed a variable into that spot?

Best Answer

You can solve the error by using the bit varying datatype. That being said, I don't actually know if this is the behavior you're looking for - it looks weird to me and doesn't match what the text version does.

testdb=# CREATE OR REPLACE FUNCTION public.ngrams(in input bit varying, in len integer)
 RETURNS SETOF bit
 LANGUAGE sql
AS $function$
SELECT cast((input << alpha) as bit varying)
FROM GENERATE_SERIES(0, LENGTH($1)-($2+1), 1) alpha;
$function$
;
CREATE FUNCTION
testdb=# CREATE OR REPLACE FUNCTION ngrams(varchar, integer) RETURNS SETOF TEXT AS $$
SELECT SUBSTRING($1 FROM n FOR $2)::TEXT
FROM GENERATE_SERIES(1, LENGTH($1)-($2-1), 1) n;
$$ LANGUAGE SQL;
CREATE FUNCTION
testdb=# with inputs as (select '10111' as x, c as y from generate_series(2, 3) c)
select x, y, ngrams(cast(x as text), y) ngrams_txt from inputs;
   x   | y | ngrams_txt 
-------+---+------------
 10111 | 2 | 10
 10111 | 2 | 01
 10111 | 2 | 11
 10111 | 2 | 11
 10111 | 3 | 101
 10111 | 3 | 011
 10111 | 3 | 111
(7 rows)

testdb=# with inputs as (select '10111' as x, c as y from generate_series(2, 3) c)
select x, y, ngrams(cast(x as bit varying), y) ngrams_bit from inputs;
   x   | y | ngrams_bit 
-------+---+------------
 10111 | 2 | 10111
 10111 | 2 | 01110
 10111 | 2 | 11100
 10111 | 3 | 10111
 10111 | 3 | 01110
(5 rows)