Postgresql – Best index for similarity function

full-text-searchindexpattern matchingpostgresqlpostgresql-9.3

So I have this table with 6.2 millions records and I have to perform search queries with similarity for one for the column.
The queries can be:

 SELECT  "lca_test".* FROM "lca_test"
 WHERE (similarity(job_title, 'sales executive') > 0.6)
 AND worksite_city = 'los angeles' 
 ORDER BY salary ASC LIMIT 50 OFFSET 0

More conditions can be added in the where(year = X, worksite_state = N, status = 'certified', visa_class = Z).

Running some of those queries can take a really long time, over 30seconds. Sometimes more than a minutes.

EXPLAIN ANALYZE of the previously mentioned query gives me this:

Limit  (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1)
->  Index Scan using index_lca_test_on_salary on lca_test  (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1)
>>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision))
>>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms
Total runtime: 33487.802 ms

I can't figure out how I should index my column to make it blazing fast.

EDIT:
Here is the postgres version:

PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit

Here is the table definition:

                                                         Table "public.lca_test"
         Column         |       Type        |                       Modifiers                       | Storage  | Stats target | Description
------------------------+-------------------+-------------------------------------------------------+----------+--------------+-------------
 id                     | integer           | not null default nextval('lca_test_id_seq'::regclass) | plain    |              |
 raw_id                 | integer           |                                                       | plain    |              |
 year                   | integer           |                                                       | plain    |              |
 company_id             | integer           |                                                       | plain    |              |
 visa_class             | character varying |                                                       | extended |              |
 employement_start_date | character varying |                                                       | extended |              |
 employement_end_date   | character varying |                                                       | extended |              |
 employer_name          | character varying |                                                       | extended |              |
 employer_address1      | character varying |                                                       | extended |              |
 employer_address2      | character varying |                                                       | extended |              |
 employer_city          | character varying |                                                       | extended |              |
 employer_state         | character varying |                                                       | extended |              |
 employer_postal_code   | character varying |                                                       | extended |              |
 employer_phone         | character varying |                                                       | extended |              |
 employer_phone_ext     | character varying |                                                       | extended |              |
 job_title              | character varying |                                                       | extended |              |
 soc_code               | character varying |                                                       | extended |              |
 naic_code              | character varying |                                                       | extended |              |
 prevailing_wage        | character varying |                                                       | extended |              |
 pw_unit_of_pay         | character varying |                                                       | extended |              |
 wage_unit_of_pay       | character varying |                                                       | extended |              |
 worksite_city          | character varying |                                                       | extended |              |
 worksite_state         | character varying |                                                       | extended |              |
 worksite_postal_code   | character varying |                                                       | extended |              |
 total_workers          | integer           |                                                       | plain    |              |
 case_status            | character varying |                                                       | extended |              |
 case_no                | character varying |                                                       | extended |              |
 salary                 | real              |                                                       | plain    |              |
 salary_max             | real              |                                                       | plain    |              |
 prevailing_wage_second | real              |                                                       | plain    |              |
 lawyer_id              | integer           |                                                       | plain    |              |
 citizenship            | character varying |                                                       | extended |              |
 class_of_admission     | character varying |                                                       | extended |              |
Indexes:
    "lca_test_pkey" PRIMARY KEY, btree (id)
    "index_lca_test_on_id_and_salary" btree (id, salary)
    "index_lca_test_on_id_and_salary_and_year" btree (id, salary, year)
    "index_lca_test_on_id_and_salary_and_year_and_wage_unit_of_pay" btree (id, salary, year, wage_unit_of_pay)
    "index_lca_test_on_id_and_visa_class" btree (id, visa_class)
    "index_lca_test_on_id_and_worksite_state" btree (id, worksite_state)
    "index_lca_test_on_lawyer_id" btree (lawyer_id)
    "index_lca_test_on_lawyer_id_and_company_id" btree (lawyer_id, company_id)
    "index_lca_test_on_raw_id_and_visa_and_pw_second" btree (raw_id, visa_class, prevailing_wage_second)
    "index_lca_test_on_raw_id_and_visa_class" btree (raw_id, visa_class)
    "index_lca_test_on_salary" btree (salary)
    "index_lca_test_on_visa_class" btree (visa_class)
    "index_lca_test_on_wage_unit_of_pay" btree (wage_unit_of_pay)
    "index_lca_test_on_worksite_state" btree (worksite_state)
    "index_lca_test_on_year_and_company_id" btree (year, company_id)
    "index_lca_test_on_year_and_company_id_and_case_status" btree (year, company_id, case_status)
    "index_lcas_job_title_trigram" gin (job_title gin_trgm_ops)
    "lca_test_company_id" btree (company_id)
    "lca_test_employer_name" btree (employer_name)
    "lca_test_id" btree (id)
    "lca_test_on_year_and_companyid_and_wage_unit_and_salary" btree (year, company_id, wage_unit_of_pay, salary)
Foreign-key constraints:
    "fk_rails_8a90090fe0" FOREIGN KEY (lawyer_id) REFERENCES lawyers(id)
Has OIDs: no

Best Answer

You forgot to mention that you installed the additional module pg_trgm, which provides the similarity() function.

Similarity operator %

First of all, whatever else you do, use the similarity operator % instead of the expression (similarity(job_title, 'sales executive') > 0.6). Much cheaper. And index support is bound to operators in Postgres, not to functions.

To get the desired minimum similarity of 0.6, run:

SELECT set_limit(0.6);

The setting stays for the rest of your session unless reset to something else. Check with:

SELECT show_limit();

This is a bit clumsy, but great for performance.

Simple case

If you just wanted the best matches in column job_title for the string 'sales executive' then this would be a simple case of "nearest neighbor" search and could be solved with a GiST index using the trigram operator class gist_trgm_ops (but not with a GIN index):

CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);

To also include an equality condition on worksite_city you would need the additional module btree_gist. Run (once per DB):

CREATE EXTENSION btree_gist;

Then:

CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);

Query:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY (job_title <-> 'sales executive')
LIMIT  50;

<-> being the "distance" operator:

one minus the similarity() value.

Postgres can also combine two separate indexes, a plain btree index on worksite_city, and a separate GiST index on job_title, but the multicolumn index should be fastest - if you combine the two columns like this in queries regularly.

Your case

However, your query sorts by salary, not by distance / similarity, which changes the nature of the game completely. Now we can use both GIN and GiST index, and GIN will be faster (even more so in Postgres 9.4 which has largely improved GIN indexes - hint!)

Similar story for the additional equality check on worksite_city: install the additional module btree_gin. Run (once per DB):

CREATE EXTENSION btree_gin;

Then:

CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);

Query:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY salary 
LIMIT  50 -- OFFSET 0

Again, this should also work (less efficiently) with the simpler index you already have ("index_lcas_job_title_trigram"), possibly in combination with other indexes. The best solution depends on the complete picture.

Asides

  • You have a lot of indexes. Are you sure they are all in use and pay their maintenance cost?

  • You have some dubious data types:

    employement_start_date | character varying
    employement_end_date   | character varying
    

    Seems like those should be date. Etc.

Related answers: