PostgreSQL GIN Index – Partial Match from TSVECTOR Column

full-text-searchpattern matchingpostgresql

I would like to get results by query this:

SELECT * FROM (
  SELECT id, subject
  FROM mailboxes
  WHERE tsv @@ plainto_tsquery('avail')
) AS t1 ORDER by id DESC;

This works and return rows with tsv containing Available. But if I use avai (dropped lable) it cannot find anything.

Do all queries have to be be in dictionary? Can't we just query such letters? I have a database that contains e-mail body (content) and I would like to make it fast as its grow every second. Currently I am using

... WHERE content ~* 'letters`

Best Answer

Do all queries have to be be in dictionary?

No. Because only word stems (according to the used text search configuration) are in the index to begin with. But more importantly:

No. Because, on top of that Full Text Search is also capable of prefix matching:

This would work:

SELECT id, subject
FROM   mailboxes
WHERE  tsv @@ to_tsquery('simple', 'avail:*')
ORDER  BY id DESC;

Note 3 things:

Use to_tsquery(), not plainto_tsquery(), in this case because (quoting the manual):

... plainto_tsquery will not recognize tsquery operators, weight labels, or prefix-match labels in its input
Use the 'simple' text search configuration to generate the tsquery since you obviously want to take the word 'avail' as is and not apply stemming.
Append :* to make it a prefix search, i.e find all lexemes starting with 'avail'.

Important: This is a prefix search on lexemes (word stems) in the document. A regular expression match without wildcards (content ~* 'avail') is not exactly the same! The latter is not left-anchored (to the start of lexemes) and would also find 'FOOavail' etc.

It's unclear whether you want the behavior outlined in your query or the equivalent of the added regular expression. Trigram indexes (pg_trgm) like @Evan already suggested are the right tool for that. There are many related questions on dba.SE, try a search.

Overview:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Demo

SELECT *
FROM (
   VALUES
     ('Zend has no framework')
   , ('Zend Framework')
   ) sub(t), to_tsvector(t) AS tsv
WHERE tsv @@ to_tsquery('zend <-> fram:*');

 id |       t        |          tsv
----+----------------+------------------------
  2 | Zend Framework | 'framework':2 'zend':1

Related answer (see chapter "Different approach to optimize search"):

How can I generate all trailing substrings following a delimeter?

Emails?

Since you mentioned emails, be aware that the text search parser identifies emails and does not split them into separate words / lexemes. Consider:

SELECT ts_debug('english', 'xangr@some.domain.com')

(email,"Email address",xangr@some.domain.com,{simple},simple,{xangr@some.domain.com})

I would replace the separators @ and . in your emails with space (' ') to index contained words.

Also, since you are dealing with names in emails, not with English (or some other language) words, I would use the 'simple' text search configuration to disable stemming and other language features:

Build the ts_vector column with:

SELECT to_tsvector('simple', translate('joe.xangr@some.domain.com', '@.', '  ')) AS tsv;

Related Solutions

Sql-server – SQL Server Fulltext search against big amount of search terms over some period of time

Execution plan XML would be useful here but as a shot in the dark:

Daft as it may sound, try pushing either the [id] or date predicate into the CONTAINSTABLE query. See SQL Server 2005 Full-Text Queries on Large Catalogs: Lessons Learned - Consider embedding filter conditions as keywords in the indexed text.
I don't remember where but I recall reading an article or blog post some time ago that flagged big OR full-text queries as problematic. IIRC the suggested hack/workaround was to issue a UNION query instead.

For your example, the UNION for 2) would be along the lines of:

INSERT
    @processed_rules
    (
    rule_id
    , story_id
    )
SELECT
    @rule_id
  , s.[key]
FROM
   CONTAINSTABLE(Stories, (header, body), '"term one"')

UNION

SELECT
    @rule_id
  , s.[key]
FROM
   CONTAINSTABLE(Stories, (header, body), '"term two"')

UNION

  SELECT
    @rule_id
  , s.[key]
FROM
   CONTAINSTABLE(Stories, (header, body), '"term three"')

Postgresql – Formulating a Join Query for PostqreSQL

Assuming (video_id, user_id) is unique in user_video, a plain LEFT JOIN would do the job:

SELECT v.*, uv.video_id IS NOT NULL AS has_bought
FROM   video v
LEFT   JOIN user_video uv ON uv.video_id = v.video_id
                         AND uv.user_id = $current_user_id;

$current_user_id being the ID of the current user.

If (video_id, user_id) is not unique, you could add GROUP BY

SELECT v.*, count(uv.video_id) > 0 AS has_bought
FROM   video v
LEFT   JOIN user_video uv ON uv.video_id = v.video_id
                         AND uv.user_id = $current_user_id
GROUP  BY v.video_id;

count() doesn't count NULL values.
Or use an EXISTS semi-join to avoid duplicates:

SELECT v.*, EXISTS (SELECT 1 FROM user_video uv 
                    WHERE  uv.video_id = v.video_id
                    AND    uv.user_id = $current_user_id) AS has_bought
FROM   video v;

Or use the PostgreSQL-specific DISTINCT ON, that's particularly handy if you need the result in a certain sort order anyway.

SELECT DISTINCT ON (v.video_id)
       v.*, uv.video_id IS NOT NULL AS has_bought
FROM   video v
LEFT   JOIN user_video uv ON uv.video_id = v.video_id
                         AND uv.user_id = $current_user_id
ORDER  BY v.video_id, uv.video_id;  -- NULL sorts last