Postgresql – How to get lexemes array from Postgresql ts_vector, sorted by lexems positional information

arrayfull-text-searchpostgresql

Postgresql full text search to_tsvector() function returns a sorted array of lexemes. I need exactly the same function, but returning unsorted array of lexemes. Is there any such postgresql function already available or I need to do it myself? How to do it?

I need it for Jaro-Winkler string similarity function, for which order of strings/words is important…I need tsvector only because of language support for lexemization. Actually, I just need to leximize text, e.g.

select array_to_string(tsvector_to_array(to_tsvector('english', 'I just want to lexemize this string and keep the original order of lexemes...')), ' ');

This gives me leximized text which is OK, except lexemes are ordered alphabetically instead their original relative order in original text…

Best Answer

This function seems to work, Please, suggest if there is a better way.

CREATE OR REPLACE FUNCTION public.tokenize_orig(
    p_string text,
    p_dictionary regconfig)
  RETURNS text[] AS
$BODY$ 
/* This function turns documents into array of lexemes, keeping original order of lexemes. */ 

select array_agg(lexeme) 
from 
    (
    select (arr).lexeme as lexeme 
    from 
        (select unnest(tsv) arr 
        from to_tsvector($2, $1) as tsv) arr 
        where array_length((arr).positions, 1) > 0 
        order by (arr).positions
    ) as qry

$BODY$
  LANGUAGE sql IMMUTABLE STRICT
  COST 100;