So I have this table with 6.2 millions records and I have to perform search queries with similarity for one for the column.
The queries can be:
SELECT "lca_test".* FROM "lca_test"
WHERE (similarity(job_title, 'sales executive') > 0.6)
AND worksite_city = 'los angeles'
ORDER BY salary ASC LIMIT 50 OFFSET 0
More conditions can be added in the where(year = X, worksite_state = N, status = 'certified', visa_class = Z).
Running some of those queries can take a really long time, over 30seconds. Sometimes more than a minutes.
EXPLAIN ANALYZE
of the previously mentioned query gives me this:
Limit (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1) -> Index Scan using index_lca_test_on_salary on lca_test (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1) >>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision)) >>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms Total runtime: 33487.802 ms
I can't figure out how I should index my column to make it blazing fast.
EDIT:
Here is the postgres version:
PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit
Here is the table definition:
Table "public.lca_test"
Column | Type | Modifiers | Storage | Stats target | Description
------------------------+-------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('lca_test_id_seq'::regclass) | plain | |
raw_id | integer | | plain | |
year | integer | | plain | |
company_id | integer | | plain | |
visa_class | character varying | | extended | |
employement_start_date | character varying | | extended | |
employement_end_date | character varying | | extended | |
employer_name | character varying | | extended | |
employer_address1 | character varying | | extended | |
employer_address2 | character varying | | extended | |
employer_city | character varying | | extended | |
employer_state | character varying | | extended | |
employer_postal_code | character varying | | extended | |
employer_phone | character varying | | extended | |
employer_phone_ext | character varying | | extended | |
job_title | character varying | | extended | |
soc_code | character varying | | extended | |
naic_code | character varying | | extended | |
prevailing_wage | character varying | | extended | |
pw_unit_of_pay | character varying | | extended | |
wage_unit_of_pay | character varying | | extended | |
worksite_city | character varying | | extended | |
worksite_state | character varying | | extended | |
worksite_postal_code | character varying | | extended | |
total_workers | integer | | plain | |
case_status | character varying | | extended | |
case_no | character varying | | extended | |
salary | real | | plain | |
salary_max | real | | plain | |
prevailing_wage_second | real | | plain | |
lawyer_id | integer | | plain | |
citizenship | character varying | | extended | |
class_of_admission | character varying | | extended | |
Indexes:
"lca_test_pkey" PRIMARY KEY, btree (id)
"index_lca_test_on_id_and_salary" btree (id, salary)
"index_lca_test_on_id_and_salary_and_year" btree (id, salary, year)
"index_lca_test_on_id_and_salary_and_year_and_wage_unit_of_pay" btree (id, salary, year, wage_unit_of_pay)
"index_lca_test_on_id_and_visa_class" btree (id, visa_class)
"index_lca_test_on_id_and_worksite_state" btree (id, worksite_state)
"index_lca_test_on_lawyer_id" btree (lawyer_id)
"index_lca_test_on_lawyer_id_and_company_id" btree (lawyer_id, company_id)
"index_lca_test_on_raw_id_and_visa_and_pw_second" btree (raw_id, visa_class, prevailing_wage_second)
"index_lca_test_on_raw_id_and_visa_class" btree (raw_id, visa_class)
"index_lca_test_on_salary" btree (salary)
"index_lca_test_on_visa_class" btree (visa_class)
"index_lca_test_on_wage_unit_of_pay" btree (wage_unit_of_pay)
"index_lca_test_on_worksite_state" btree (worksite_state)
"index_lca_test_on_year_and_company_id" btree (year, company_id)
"index_lca_test_on_year_and_company_id_and_case_status" btree (year, company_id, case_status)
"index_lcas_job_title_trigram" gin (job_title gin_trgm_ops)
"lca_test_company_id" btree (company_id)
"lca_test_employer_name" btree (employer_name)
"lca_test_id" btree (id)
"lca_test_on_year_and_companyid_and_wage_unit_and_salary" btree (year, company_id, wage_unit_of_pay, salary)
Foreign-key constraints:
"fk_rails_8a90090fe0" FOREIGN KEY (lawyer_id) REFERENCES lawyers(id)
Has OIDs: no
Best Answer
You forgot to mention that you installed the additional module
pg_trgm
, which provides thesimilarity()
function.Similarity operator
%
First of all, whatever else you do, use the similarity operator
%
instead of the expression(similarity(job_title, 'sales executive') > 0.6)
. Much cheaper. And index support is bound to operators in Postgres, not to functions.To get the desired minimum similarity of
0.6
, run:The setting stays for the rest of your session unless reset to something else. Check with:
This is a bit clumsy, but great for performance.
Simple case
If you just wanted the best matches in column
job_title
for the string 'sales executive' then this would be a simple case of "nearest neighbor" search and could be solved with a GiST index using the trigram operator classgist_trgm_ops
(but not with a GIN index):To also include an equality condition on
worksite_city
you would need the additional modulebtree_gist
. Run (once per DB):Then:
Query:
<->
being the "distance" operator:Postgres can also combine two separate indexes, a plain btree index on
worksite_city
, and a separate GiST index onjob_title
, but the multicolumn index should be fastest - if you combine the two columns like this in queries regularly.Your case
However, your query sorts by
salary
, not by distance / similarity, which changes the nature of the game completely. Now we can use both GIN and GiST index, and GIN will be faster (even more so in Postgres 9.4 which has largely improved GIN indexes - hint!)Similar story for the additional equality check on
worksite_city
: install the additional modulebtree_gin
. Run (once per DB):Then:
Query:
Again, this should also work (less efficiently) with the simpler index you already have (
"index_lcas_job_title_trigram"
), possibly in combination with other indexes. The best solution depends on the complete picture.Asides
You have a lot of indexes. Are you sure they are all in use and pay their maintenance cost?
You have some dubious data types:
Seems like those should be
date
. Etc.Related answers: