Postgresql – query discrepancies on very large postgres table

indexperformancepostgresql

I have a fairly large table (25m entries) which stores tracking data, I am not a postgres admin by trade so this may be a simple problem.

these two queries although similar are taking massively different amounts of time and I am not sure why.

SELECT MAX(logdatetime) FROM tlog WHERE pointid = 'and.andrew' (5100ms)
SELECT MAX(logdatetime) FROM tlog WHERE pointid = '359710040139645' (41ms)

logdatetime is indexed, and I ran a VACUUM and REINDEX on the table when I noticed the discrepancy, but it hasn't resolved the issue.

Here is the EXPLAIN ANALYZE output for each of the queries:

--- and.andrew
Result  (cost=35.20..35.21 rows=1 width=0) (actual time=1924.811..1924.811 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..35.20 rows=1 width=8) (actual time=1924.804..1924.804 rows=1 loops=1)
          ->  Index Scan Backward using logdatetime on tlog  (cost=0.00..2434879.14 rows=69168 width=8) (actual time=1924.800..1924.800 rows=1 loops=1)
                Index Cond: (logdatetime IS NOT NULL)
                Filter: ((pointid)::text = 'and.andrew'::text)
Total runtime: 1924.853 ms

--- 359710040139645
Result  (cost=1.44..1.45 rows=1 width=0) (actual time=32.525..32.525 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..1.44 rows=1 width=8) (actual time=32.518..32.518 rows=1 loops=1)
          ->  Index Scan Backward using logdatetime on tlog  (cost=0.00..2434879.14 rows=1688018 width=8) (actual time=32.513..32.513 rows=1 loops=1)
                Index Cond: (logdatetime IS NOT NULL)
                Filter: ((pointid)::text = '359710040139645'::text)
Total runtime: 32.568 ms

The column pointid is defined as character varying (50), in case that matters.

Best Answer

The output from EXPLAIN ANALYZE is very helpful for understanding this. In particular, note these two lines:

->  Index Scan Backward using logdatetime on tlog
    (cost=0.00..2434879.14 rows=69168 width=8)
    (actual time=1924.800..1924.800 rows=1 loops=1)

->  Index Scan Backward using logdatetime on tlog
    (cost=0.00..2434879.14 rows=1688018 width=8)
    (actual time=32.513..32.513 rows=1 loops=1)

I reformatted and added bold emphasis.

Since you only seem to have an index on logdatetime that's what the query planner is going to use. It seems promising that you only want a single row, with MAX(logdatetime), so the query planner thinks it's a good idea to start from the back of the index (Index Scan Backward) and scan until the first row with matching pointid pops up.

This strategy works out just nicely for pointid = '359710040139645', but not so much for pointid = 'and.andrew'. The maximum logdatetime for this particular pointid is obviously rather old, so a lot of rows (1.688.018) have to be read before the first hit (which is the youngest row).

But still a lot better than a full table scan where you would need to read all 25m rows to identify the matching ones, then find the max among these. (Try deleting the index an rerun the query. Be prepared for a long wait.)

If you can afford adding indexes (some cost to write operations and considerable disk space for 25m rows) try adding an index on pointid instead, or better yet, this multi-column index:

CREATE INDEX tlog_pointid_logdatetime ON tlog (pointid, logdatetime DESC);

This would make your query fly.

If you don't want indexes to slow down your writes, you might create this index just for the purpose of running a couple of queries and delete it again. Index creation is pretty fast in PostgreSQL. This may pay even for just a few queries.