Postgresql – When might a tsvector field pay for itself

full-text-searchpostgresql

I've been experimenting with tsvector indexes for Full Text Search, and see that it's a common practice to generate an store a vector in a column of type tsvector. We're on Postgres 11.4, but I've already seen this very practice used as an example for PG 12 generated columns. (Simpler than using a trigger for the same purpose.)

My question is, what is the benefit? I tried an expression GIN index on the tsvector of a text field, and a GIN index on the stored tsvector. With around 8M rows locally, I couldn't measure any meaningful speed difference. Given that storing the vector as a column and an index takes more space, I'm curious if there are obvious cases where the extra cost is justified. For example, when you've got a lot more roles.

Note: We're storing the text in the database, so this isn't one of those setups where you index an external page/document/etc without absobing the source text into the database.

Best Answer

If you use proximity searching (like "phraseto_tsquery" for example), with a functional index it will have to re-parse each match candidate document to a tsvector and check it for the right sequence and spacing of words. This can be quite slow, especially if the number of candidates is much higher than the number of final results. If the tsvector is stored, it can just read it and not reparse the document, this is much faster. I think other advanced features like "ts_headline" may be in the same situation--but I haven't tested them.

Even if you just use "@@", I think that if the bitmap of the number of results doesn't fit in "work_mem", then it will also need to reparse the documents in order to recheck candidate matching blocks which "overflowed". Of course in that case increasing "work_mem" would be likely be a better option than adding a column.

For what it is worth, if you use RUM rather than GIN, it will solve the the problem with phraseto_tsquery on functional indexes.