Postgresql – Best index for similarity function

full-text-searchindexpattern matchingpostgresqlpostgresql-9.3

So I have this table with 6.2 millions records and I have to perform search queries with similarity for one for the column.
The queries can be:

 SELECT  "lca_test".* FROM "lca_test"
 WHERE (similarity(job_title, 'sales executive') > 0.6)
 AND worksite_city = 'los angeles' 
 ORDER BY salary ASC LIMIT 50 OFFSET 0

More conditions can be added in the where(year = X, worksite_state = N, status = 'certified', visa_class = Z).

Running some of those queries can take a really long time, over 30seconds. Sometimes more than a minutes.

EXPLAIN ANALYZE of the previously mentioned query gives me this:

Limit  (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1)
->  Index Scan using index_lca_test_on_salary on lca_test  (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1)
>>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision))
>>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms
Total runtime: 33487.802 ms

I can't figure out how I should index my column to make it blazing fast.

EDIT:
Here is the postgres version:

PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit

Here is the table definition:

                                                         Table "public.lca_test"
         Column         |       Type        |                       Modifiers                       | Storage  | Stats target | Description
------------------------+-------------------+-------------------------------------------------------+----------+--------------+-------------
 id                     | integer           | not null default nextval('lca_test_id_seq'::regclass) | plain    |              |
 raw_id                 | integer           |                                                       | plain    |              |
 year                   | integer           |                                                       | plain    |              |
 company_id             | integer           |                                                       | plain    |              |
 visa_class             | character varying |                                                       | extended |              |
 employement_start_date | character varying |                                                       | extended |              |
 employement_end_date   | character varying |                                                       | extended |              |
 employer_name          | character varying |                                                       | extended |              |
 employer_address1      | character varying |                                                       | extended |              |
 employer_address2      | character varying |                                                       | extended |              |
 employer_city          | character varying |                                                       | extended |              |
 employer_state         | character varying |                                                       | extended |              |
 employer_postal_code   | character varying |                                                       | extended |              |
 employer_phone         | character varying |                                                       | extended |              |
 employer_phone_ext     | character varying |                                                       | extended |              |
 job_title              | character varying |                                                       | extended |              |
 soc_code               | character varying |                                                       | extended |              |
 naic_code              | character varying |                                                       | extended |              |
 prevailing_wage        | character varying |                                                       | extended |              |
 pw_unit_of_pay         | character varying |                                                       | extended |              |
 wage_unit_of_pay       | character varying |                                                       | extended |              |
 worksite_city          | character varying |                                                       | extended |              |
 worksite_state         | character varying |                                                       | extended |              |
 worksite_postal_code   | character varying |                                                       | extended |              |
 total_workers          | integer           |                                                       | plain    |              |
 case_status            | character varying |                                                       | extended |              |
 case_no                | character varying |                                                       | extended |              |
 salary                 | real              |                                                       | plain    |              |
 salary_max             | real              |                                                       | plain    |              |
 prevailing_wage_second | real              |                                                       | plain    |              |
 lawyer_id              | integer           |                                                       | plain    |              |
 citizenship            | character varying |                                                       | extended |              |
 class_of_admission     | character varying |                                                       | extended |              |
Indexes:
    "lca_test_pkey" PRIMARY KEY, btree (id)
    "index_lca_test_on_id_and_salary" btree (id, salary)
    "index_lca_test_on_id_and_salary_and_year" btree (id, salary, year)
    "index_lca_test_on_id_and_salary_and_year_and_wage_unit_of_pay" btree (id, salary, year, wage_unit_of_pay)
    "index_lca_test_on_id_and_visa_class" btree (id, visa_class)
    "index_lca_test_on_id_and_worksite_state" btree (id, worksite_state)
    "index_lca_test_on_lawyer_id" btree (lawyer_id)
    "index_lca_test_on_lawyer_id_and_company_id" btree (lawyer_id, company_id)
    "index_lca_test_on_raw_id_and_visa_and_pw_second" btree (raw_id, visa_class, prevailing_wage_second)
    "index_lca_test_on_raw_id_and_visa_class" btree (raw_id, visa_class)
    "index_lca_test_on_salary" btree (salary)
    "index_lca_test_on_visa_class" btree (visa_class)
    "index_lca_test_on_wage_unit_of_pay" btree (wage_unit_of_pay)
    "index_lca_test_on_worksite_state" btree (worksite_state)
    "index_lca_test_on_year_and_company_id" btree (year, company_id)
    "index_lca_test_on_year_and_company_id_and_case_status" btree (year, company_id, case_status)
    "index_lcas_job_title_trigram" gin (job_title gin_trgm_ops)
    "lca_test_company_id" btree (company_id)
    "lca_test_employer_name" btree (employer_name)
    "lca_test_id" btree (id)
    "lca_test_on_year_and_companyid_and_wage_unit_and_salary" btree (year, company_id, wage_unit_of_pay, salary)
Foreign-key constraints:
    "fk_rails_8a90090fe0" FOREIGN KEY (lawyer_id) REFERENCES lawyers(id)
Has OIDs: no

Best Answer

You forgot to mention that you installed the additional module pg_trgm, which provides the similarity() function.

Similarity operator `%`

First of all, whatever else you do, use the similarity operator % instead of the expression (similarity(job_title, 'sales executive') > 0.6). Much cheaper. And index support is bound to operators in Postgres, not to functions.

To get the desired minimum similarity of 0.6, run:

SELECT set_limit(0.6);

The setting stays for the rest of your session unless reset to something else. Check with:

SELECT show_limit();

This is a bit clumsy, but great for performance.

Simple case

If you just wanted the best matches in column job_title for the string 'sales executive' then this would be a simple case of "nearest neighbor" search and could be solved with a GiST index using the trigram operator class gist_trgm_ops (but not with a GIN index):

CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);

To also include an equality condition on worksite_city you would need the additional module btree_gist. Run (once per DB):

CREATE EXTENSION btree_gist;

Then:

CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);

Query:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY (job_title <-> 'sales executive')
LIMIT  50;

<-> being the "distance" operator:

one minus the similarity() value.

Postgres can also combine two separate indexes, a plain btree index on worksite_city, and a separate GiST index on job_title, but the multicolumn index should be fastest - if you combine the two columns like this in queries regularly.

Your case

However, your query sorts by salary, not by distance / similarity, which changes the nature of the game completely. Now we can use both GIN and GiST index, and GIN will be faster (even more so in Postgres 9.4 which has largely improved GIN indexes - hint!)

Similar story for the additional equality check on worksite_city: install the additional module btree_gin. Run (once per DB):

CREATE EXTENSION btree_gin;

Then:

CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);

Query:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY salary 
LIMIT  50 -- OFFSET 0

Again, this should also work (less efficiently) with the simpler index you already have ("index_lcas_job_title_trigram"), possibly in combination with other indexes. The best solution depends on the complete picture.

Asides

You have a lot of indexes. Are you sure they are all in use and pay their maintenance cost?

You have some dubious data types:

employement_start_date | character varying
employement_end_date   | character varying

Seems like those should be date. Etc.

Assessment

In your last query, the bitmap index scan looking for 'hat' produces 307 hits.
Postgres then runs a bitmap heap scan to filter merchants similar enough ( similarity(...) > 0.2), producing 12 rows. Your test is with 30K rows, so your real life query will produce around 300 times as many hits, 90k / 3.5k for the test case at hand. An additional index on merchant will help.

Advice

I suggest you create an additional trigram index for the similarity search. Be sure to read the chapter in the manual about trigram index support. We need the additional module pg_trgminstalled (like you obviously have).

For your first request:

How can I search for a query like 'WALMART BAGS' which will first return me product BAG with merchant WALMART and then BAGS from other merchants.

I suggest this query using the similarity operator %:

-- SELECT set_limit(0.2)  -- Adjust similarity operator only if needed

SELECT *
FROM   products
WHERE  to_tsvector('english', product) @@ to_tsquery('bag')
AND    merchant % 'walmart'
ORDER  BY merchant <-> 'walmart'
--    LIMIT  n; -- possibly limit to top n results

Again, you can choose between GiST and GIN, but this time GiST carries a decisive advantage:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

Therefore, I suggest this index:

CREATE INDEX prod_merchant_trgm_idx ON products USING gist (merchant gist_trgm_ops);

As for your second request:

Can I have both GIN and GIST index working for me?

Yes, you can. It would hardly make sense to have both types for the same (combination of) column(s), but Postgres can combine GiST and GIN indices very well in the same query. I quote the excellent manual yet again, on Combining Multiple Indexes:

To combine multiple indexes, the system scans each needed index and prepares a bitmap in memory giving the locations of table rows that are reported as matching that index's conditions. The bitmaps are then ANDed and ORed together as needed by the query. Finally, the actual table rows are visited and returned. The table rows are visited in physical order, because that is how the bitmap is laid out; this means that any ordering of the original indexes is lost, and so a separate sort step will be needed if the query has an ORDER BY clause. For this reason, and because each additional index scan adds extra time, the planner will sometimes choose to use a simple index scan even though additional indexes are available that could have been used as well.

Postgresql – Slow fulltext search due to wildly inaccurate row estimates

This can be improved in a thousand and one ways, then it should be a matter of milliseconds.

Better Queries

This is just your query reformatted with aliases and some noise removed to clear the fog:

SELECT count(DISTINCT t.id)
FROM   tickets      t
JOIN   transactions tr ON tr.objectid = t.id
JOIN   attachments  a  ON a.transactionid = tr.id
WHERE  t.status <> 'deleted'
AND    t.type = 'ticket'
AND    t.effectiveid = t.id
AND    tr.objecttype = 'RT::Ticket'
AND    a.contentindex @@ plainto_tsquery('frobnicate');

Most of the problem with your query lies in the first two tables tickets and transactions, which are missing from the question. I'm filling in with educated guesses.

t.status, t.objecttype and tr.objecttype should probably not be text, but enum or possibly some very small value referencing a look-up table.

`EXISTS` semi-join

Assuming tickets.id is the primary key, this rewritten form should be much cheaper:

SELECT count(*)
FROM   tickets t
WHERE  status <> 'deleted'
AND    type = 'ticket'
AND    effectiveid = id
AND    EXISTS (
   SELECT 1
   FROM   transactions tr
   JOIN   attachments  a ON a.transactionid = tr.id
   WHERE  tr.objectid = t.id
   AND    tr.objecttype = 'RT::Ticket'
   AND    a.contentindex @@ plainto_tsquery('frobnicate')
   );

Instead of multiplying rows with two 1:n joins, only to collapse multiple matches in the end with count(DISTINCT id), use an EXISTS semi-join, which can stop looking further as soon as the first match is found and at the same time obsoletes the final DISTINCT step. Per documentation:

The subquery will generally only be executed long enough to determine whether at least one row is returned, not all the way to completion.

Effectiveness depends on how many transactions per ticket and attachments per transaction there are.

Determine order of joins with `join_collapse_limit`

If you know that your search term for attachments.contentindex is very selective - more selective than other conditions in the query (which is probably the case for 'frobnicate', but not for 'problem'), you can force the sequence of joins. The query planner can hardly judge selectiveness of particular words, except for the most common ones. Per documentation:

join_collapse_limit (integer)

[...]
Because the query planner does not always choose the optimal join order, advanced users can elect to temporarily set this variable to 1, and then specify the join order they desire explicitly.

Use SET LOCAL for the purpose to only set it for the current transaction.

BEGIN;
SET LOCAL join_collapse_limit = 1;

SELECT count(DISTINCT t.id)
FROM   attachments  a                              -- 1st
JOIN   transactions tr ON tr.id = a.transactionid  -- 2nd
JOIN   tickets      t  ON t.id = tr.objectid       -- 3rd
WHERE  t.status <> 'deleted'
AND    t.type = 'ticket'
AND    t.effectiveid = t.id
AND    tr.objecttype = 'RT::Ticket'
AND    a.contentindex @@ plainto_tsquery('frobnicate');

ROLLBACK; -- or COMMIT;

The order of WHERE conditions is always irrelevant. Only the order of joins is relevant here.

Or use a CTE like @jjanes explains in "Option 2". for a similar effect.

Indexes

B-tree indexes

Take all conditions on tickets that are used identically with most queries and create a partial index on tickets:

CREATE INDEX tickets_partial_idx
ON tickets(id)
WHERE  status <> 'deleted'
AND    type = 'ticket'
AND    effectiveid = id;

If one of the conditions is variable, drop it from the WHERE condition and prepend the column as index column instead.

Another one on transactions:

CREATE INDEX transactions_partial_idx
ON transactions(objecttype, objectid, id)

The third column is just to enable index-only scans.

Also, since you have this composite index with two integer columns on attachments:

"attachments3" btree (parent, transactionid)

This additional index is a complete waste, delete it:

"attachments1" btree (parent)

Details:

Is a composite index also good for queries on the first field?

GIN index

Add transactionid to your GIN index to make it a lot more effective. This may be another silver bullet, because it potentially allows index-only scans, eliminating visits to the big table completely.
You need additional operator classes provided by the additional module btree_gin. Detailed instructions:

Inner join using an array column

"contentindex_idx" gin (transactionid, contentindex)

4 bytes from an integer column don't make the index much bigger. Also, fortunately for you, GIN indexes are different from B-tree indexes in a crucial aspect. Per documentation:

A multicolumn GIN index can be used with query conditions that involve any subset of the index's columns. Unlike B-tree or GiST, index search effectiveness is the same regardless of which index column(s) the query conditions use.

Bold emphasis mine. So you just need the one (big and somewhat costly) GIN index.

Table definition

Move the integer not null columns to the front. This has a couple of minor positive effects on storage and performance. Saves 4 - 8 bytes per row in this case.

                      Table "public.attachments"
         Column      |            Type             |         Modifiers
    -----------------+-----------------------------+------------------------------
     id              | integer                     | not null default nextval('...
     transactionid   | integer                     | not null
     parent          | integer                     | not null default 0
     creator         | integer                     | not null default 0  -- !
     created         | timestamp                   |                     -- !
     messageid       | character varying(160)      |
     subject         | character varying(255)      |
     filename        | character varying(255)      |
     contenttype     | character varying(80)       |
     contentencoding | character varying(80)       |
     content         | text                        |
     headers         | text                        |
     contentindex    | tsvector                    |