Postgresql select max performance

performancepostgresqlquery-performance

I have very simple function:

CREATE OR REPLACE FUNCTION myscheme.get_last_timecode(ptable_ids integer[], ptimecode bigint DEFAULT NULL::bigint)
  RETURNS bigint AS
$BODY$
  SELECT MAX(timecode) AS timecode
  FROM myscheme.event
  WHERE table_id = ANY($1) AND ($2 IS NULL OR timecode <= $2);
$BODY$
  LANGUAGE sql STABLE
  COST 100;

Count of rows:

SELECT COUNT(*) FROM myscheme.event WHERE table_id = 1; -- 120
SELECT COUNT(*) FROM myscheme.event WHERE table_id = 2; -- 18
SELECT COUNT(*) FROM myscheme.event WHERE table_id = 3; -- 839795

The results are quite unexpected:

EXPLAIN ANALYZE SELECT myscheme.get_last_timecode(ARRAY[1], NULL) -- Total runtime: 212.552 ms
EXPLAIN ANALYZE SELECT myscheme.get_last_timecode(ARRAY[2], NULL) -- Total runtime: 213.713 ms
EXPLAIN ANALYZE SELECT myscheme.get_last_timecode(ARRAY[3], NULL) -- Total runtime: 0.186 ms (the fastest!)

When I use plain query, execution time is normal:

EXPLAIN ANALYZE SELECT MAX(timecode) AS timecode FROM myscheme.event
  WHERE table_id = ANY(ARRAY[1]) AND (NULL IS NULL OR timecode <= NULL);

Aggregate  (cost=10.18..10.19 rows=1 width=8) (actual time=0.079..0.079 rows=1 loops=1)
  ->  Index Scan using event_table_id_index on event  (cost=0.42..10.05 rows=51 width=8) (actual time=0.013..0.067 rows=120 loops=1)
        Index Cond: (table_id = ANY ('{1}'::integer[]))
Total runtime: 0.101 ms

but it uses another execution plan for table_id = 3:

EXPLAIN ANALYZE SELECT MAX(timecode) AS timecode FROM poker.event
  WHERE table_id = ANY(ARRAY[3]) AND (NULL IS NULL OR timecode <= NULL);

Result  (cost=0.47..0.48 rows=1 width=0) (actual time=0.018..0.018 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.42..0.47 rows=1 width=8) (actual time=0.015..0.016 rows=1 loops=1)
          ->  Index Scan Backward using event_timecode_key on event  (cost=0.42..35873.11 rows=845301 width=8) (actual time=0.015..0.015 rows=1 loops=1)
                Index Cond: (timecode IS NOT NULL)
                Filter: (table_id = ANY ('{3}'::integer[]))
Total runtime: 0.038 ms

Can anybody explain me how to create a function (or index) whose execution time will not depend on the amount of data?

SELECT version();
PostgreSQL 9.3.14 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit

And the table definition:

CREATE TABLE myscheme.event
(
  id bigserial NOT NULL,
  table_id integer,
  deal_id bigint,
  type_id integer NOT NULL,
  timecode bigserial NOT NULL,
  created timestamp with time zone NOT NULL DEFAULT now(),
  parent_id bigint,
  prev_id bigint,
  CONSTRAINT event_pkey PRIMARY KEY (id),
  CONSTRAINT event_deal_id_fkey FOREIGN KEY (deal_id)
      REFERENCES myscheme.deal (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT event_parent_id_fkey FOREIGN KEY (parent_id)
      REFERENCES myscheme.event (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT event_prev_id_fkey FOREIGN KEY (prev_id)
      REFERENCES myscheme.event (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT event_table_id_fkey FOREIGN KEY (table_id)
      REFERENCES myscheme.tables (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT event_type_id_fkey FOREIGN KEY (type_id)
      REFERENCES myscheme.event_type (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT event_timecode_key UNIQUE (timecode)
)
WITH (
  OIDS=FALSE
);

CREATE INDEX event_deal_id_index ON myscheme.event USING btree (deal_id);
CREATE INDEX event_table_id_index ON myscheme.event  USING btree (table_id);
CREATE INDEX event_type_id_index ON myscheme.event USING btree (type_id);

Best Answer

if you look at the statistics the execution plans make perfect sense. The table has around 840k tuples almost all of table id 3. So if you're looking for max(timecode) scanning the timecode index backwards makes perfect sense with table id 3 (almost every line you hit has that id).

But if you're looking for table id 1 - you have only 120 tuples out of 840k, your best bet is to scan these 120 tuples looking for the max(timecode).

It's like looking for a needle in a hay stack - looking for hay is simpler than looking for the needle.

Hope it made some sense into what the optimizer planned.

Btw, you could try to build a composite index of table id and timecode - I'm not near a db to check but I think PG could use that to skip to the max timecode per table id directly. - worth a shot :)

Regards Jony

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

I see a couple of issues.

The biggest one is that PG is using a sequence scan on A when filtering A. I think you need a composite index on A.flag AND A.strvalue. If there is already an index available, PostgreSQL is choosing not to use it for some reason. This seems to be eating up 92% of your cost estimate and is likely what's making it run for so long.

As for the ILIKE, PostgreSQL cannot natively (but see below for a module that can) use an index as long as your wildcard is the first character. That's simply a restriction on the ILIKE operator. For that reason you are getting a sequence scan which means every single row is being loaded and the C.name column is being scanned for characters. But one thing that's weird is that the ILIKE sequence scan doesn't seem to be eating up much of the cost estimate in this query plan. Anyway, if it is the ILIKE operator causing the slowdown, I would consider rewriting your query so that it somehow looks like this: ILIKE 'value%' or else consider using PostgreSQL's full text search.

UPDATED

The ILIKE operator can use a trigram index. Superb!

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question