Postgresql – Autogenerate indices on postgresql or speed up queries

index-tuningpostgresql

I have a problem that seems to come back from time to time on a client database. The current table that is causing most of the problems has around 200,000 records. I'd guess it's not technically a big amount of data.

But for some reasons, simple queries can take up as much as 30s to execute.

The only way I found to speed up the queries was to create an index manually on the fields in the query and run vacuum analyze.

That said, every time a user adds a custom filter as a default search, it seems to be unable to use existing indexes and fall back to the worst case scenario which is scanning sequentially every records.

I was wondering if there was a configuration or tool that can analyze slow queries and generate the proper indexes when required. I remember I once used a tool for mssql that would start logging queries made and then generate the proper indexes for the database according to its usage to speed up queries.

That said, I was wondering if there was something inherently wrong with the database configuration.

200,000 records in 30sec is around 6666 records per second. I'd expect to be able to process much more data than that on a 2.6ghz Xeon dualcore. Considering that it's not really busy doing anything else.

--------------------------------------------------------------------------------------
 Limit (actual time=31539.516..31539.525 rows=80 loops=1)
   ->  Sort (actual time=31539.516..31539.521 rows=80 loops=1)
         Sort Key: priority DESC, date_action, id DESC
         Sort Method: top-N heapsort  Memory: 28kB
         ->  Seq Scan on crm_lead (actual time=0.006..31443.145 rows=173024 loops=1)
               Filter: (active AND (((type)::text = 'lead'::text) OR (type IS NULL)))
               Rows Removed by Filter: 8158
 Planning time: 0.362 ms
 Execution time: 31539.552 ms

After the indices the same query goes down to 114ms.

                                           Table "public.crm_lead"
         Column         |            Type             |                       Modifiers                       
------------------------+-----------------------------+-------------------------------------------------------
 id                     | integer                     | not null default nextval('crm_lead_id_seq'::regclass)
 date_closed            | timestamp without time zone | 
 create_date            | timestamp without time zone | 
 probability            | double precision            | 
 message_last_post      | timestamp without time zone | 
 color                  | integer                     | 
 date_last_stage_update | timestamp without time zone | 
 date_action_last       | timestamp without time zone | 
 campaign_id            | integer                     | 
 day_close              | double precision            | 
 write_uid              | integer                     | 
 team_id                | integer                     | 
 day_open               | double precision            | 
 contact_name           | character varying           | 
 partner_id             | integer                     | 
 date_action_next       | timestamp without time zone | 
 city                   | character varying           | 
 date_conversion        | timestamp without time zone | 
 opt_out                | boolean                     | 
 date_open              | timestamp without time zone | 
 title                  | integer                     | 
 partner_name           | character varying           | 
 planned_revenue        | double precision            | 
 country_id             | integer                     | 
 company_id             | integer                     | 
 priority               | character varying           | 
 next_activity_id       | integer                     | 
 email_cc               | text                        | 
 type                   | character varying           | not null
 function               | character varying           | 
 fax                    | character varying           | 
 zip                    | character varying           | 
 description            | text                        | 
 create_uid             | integer                     | 
 street2                | character varying           | 
 title_action           | character varying           | 
 phone                  | character varying           | 
 lost_reason            | integer                     | 
 write_date             | timestamp without time zone | 
 state_id               | integer                     | 
 active                 | boolean                     | 
 user_id                | integer                     | 
 date_action            | date                        | 
 name                   | character varying           | not null
 stage_id               | integer                     | 
 medium_id              | integer                     | 
 date_deadline          | date                        | 
 mobile                 | character varying           | 
 street                 | character varying           | 
 source_id              | integer                     | 
 email_from             | character varying           | 
 message_bounce         | integer                     | 
 referred               | character varying           | 

The query:

SELECT "crm_lead".id FROM "crm_lead" WHERE (("crm_lead"."active" = True)  AND  (("crm_lead"."type" = 'lead')  OR  "crm_lead"."type" is Null )) ORDER BY "crm_lead"."priority" DESC,"crm_lead"."date_action" ,"crm_lead"."id" DESC  limit 80;

The OS in question is Ubuntu with Postgresql 9.5. The configuration is the default config and nothing was changed as far as I know.

Here's the result with a query using a field not part of the indexes:

explain (analyze, buffers) SELECT "crm_lead".id FROM "crm_lead" WHERE (("crm_lead"."active" = True)  AND  (("crm_lead"."type" = 'lead')  OR  "crm_lead"."type" is Null )) ORDER BY "crm_lead"."priority" DESC,"crm_lead"."date_action" ,"crm_lead"."id" DESC  limit 80;

Results

                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=21899.78..21899.98 rows=80 width=10) (actual time=35422.061..35422.073 rows=80 loops=1)
   Buffers: shared hit=9 read=13262
   I/O Timings: read=35056.998
   ->  Sort  (cost=21899.78..22334.98 rows=174080 width=10) (actual time=35422.059..35422.063 rows=80 loops=1)
         Sort Key: priority DESC, date_action, id DESC
         Sort Method: top-N heapsort  Memory: 28kB
         Buffers: shared hit=9 read=13262
         I/O Timings: read=35056.998
         ->  Seq Scan on crm_lead  (cost=0.00..15526.77 rows=174080 width=10) (actual time=2.224..35314.539 rows=173278 loops=1)
               Filter: (active AND (((type)::text = 'lead'::text) OR (title IS NULL)))
               Rows Removed by Filter: 7904
               Buffers: shared read=13262
               I/O Timings: read=35056.998
 Planning time: 148.101 ms
 Execution time: 35422.119 ms
(15 rows)

Size:

SELECT pg_size_pretty( pg_total_relation_size('crm_lead'));
 pg_size_pretty 
----------------
 204 MB

Best Answer

This part of your execution plan tells the story:

    ->  Seq Scan on crm_lead...
           Buffers: shared read=13262
           I/O Timings: read=35056.998

By turning on track_io_timing, you see that it took 35 seconds to read the data from disk, which is essentially the entire time it took the query to run (therefor, CPU doesn't matter as it is not the limiting factor). During this time, it read 13,262 buffers of 8k each, or 103 MB. That is < 3MB per second. That is absolutely abysmal for a sequential read on any reasonably modern hard drive. So something is pathologically wrong with your disk subsystem. Maybe the disk is failing so it keeps re-reading sectors in order to scrape the data off of them. Maybe you have a virtualization layer which is buggy or is intentionally throttling you, or is just vastly overloaded by other users invisibly sharing access to the same disk. Or maybe an antivirus.

If you could get adequate disk performance, you might not need extra indexes to get adequate query performance. Of course your system should also be able to cache the entire 103MB in RAM, but apparently is not. Perhaps your hardware is desperately inadequate on several fronts.