PostgreSQL planner over estimates and self join query large table optimization

optimizationperformancepostgresqlpostgresql-11query-performance

  • PostgreSQL 11.5
  • 4 core, 32gigs
  • 250gig SSD

Not using default configuration, tuned to fit system memory and cores.

Table+index size 14Gigs

I'm working on a query that self joins to a table via a range that currently has 122 million rows and will keep on growing. This table can fit entirely in memory, there is no disk I/O being reported when the query runs.

Table definitions:

                                          Table "api.foo"
   Column    |       Type       | Collation | Nullable |                     Default
-------------+------------------+-----------+----------+-------------------------------------------------
 foo_id  | integer              |           | not null | nextval('api.foo_foo_id_seq'::regclass)
 bar_id  | integer              |           | not null |
 tx          | double precision |           | not null |
Indexes:
    "foo_pkey" PRIMARY KEY, btree (foo_id)
    "foo_bar_id_tx_idx" btree (bar_id, tx)
    "foo_tx_idx" btree (tx) INCLUDE (bar_id)
Foreign-key constraints:
    "foo_bar_id_fkey" FOREIGN KEY (bar_id) REFERENCES api.bar(bar_id) ON DELETE CASCADE

The query in question

 select f1.bar_id as bar1_id, f2.bar_id as bar2_id, count(f2.tx) as matched_tx_count,
    (
      count(f2.tx) /
      sqrt(
          (
            f1c.count ^ 2 +
            (select count(bar_id) from api.foo where bar_id = f2.bar_id) ^ 2
          ) / 2.0
      )
    ) as score
from
(select count(bar_id) from api.foo where bar_id = 2) f1c,
api.foo f1
join api.foo f2 ON f2.tx between f1.tx - public.tx_tol(f1.tx, 15) AND f1.tx + public.tx_tol(f1.tx, 15)
where f1.bar_id = 2
group by f1.bar_id, f2.bar_id, f1c.count;

For completion sake, there is a helper function used in the query (tx_tol) we use frequently in the database, so this is the function definition

  CREATE OR REPLACE FUNCTION public.tx_tol(
      tx double precision,
      f2mtol double precision)
      RETURNS double precision
      LANGUAGE 'sql'
      COST 100
      IMMUTABLE STRICT
  AS $BODY$
      SELECT f2mtol * tx / 1000000
  $BODY$

Sample of what the data looks like in the table

 foo_id | bar_id |        tx
--------+--------+------------------
      1 |      94 | 187.108249295439
      2 |      94 | 197.126736011514
      3 |      94 | 197.141687194005
      4 |      83 | 204.135479695667
      5 |      83 | 204.155339708171
      6 |      83 | 225.123088580666
      7 |      83 | 317.216454470099
      8 |      71 | 410.226810690235
      9 |      71 | 410.241306155745
     10 |      71 | 428.163179075099

Things to note about this data. Due to the nature of this data, the TX column should be unique. There are no duplicate values in this column and again there are 122 million rows and increasing.

Table statistics for the column tx have been increased to 10000 but I dont think this would help anything due to all values being unique, but I could be wrong.

Of course vacuum analyze has been run.

The entry part of this query is choosing a bar_id to start from, and to use all of the rows bar_id has, and create a bounding box for each row and find other rows that match within that bounding range, and report back other bar_ids that match this range, and report their score.

Essentially, other bar_ids may have N rows, but maybe only 4 of those rows match within our bounding range, thus we calculate a score based on the total rows both have, and the count of the matched rows, perform some arithmetic and calculate a value between 0 and 1. 1 being a 100% match.

Example results

 bar1_id  | bar2_id     | matched_tx_count |        score
----------+-------------+------------------+-----------
        2 |           1 |                3 |   0.0523463761530987
        2 |           2 |               13 |                    1
        2 |           4 |                4 |    0.229039333725547
        2 |           5 |                8 |    0.263038379688572
        2 |           9 |                4 |    0.125925710004112

So we can see, our id of interest is 2. It matched against itself and has a score of 1, which is true, because id 2 has a total of 13 rows in this table, and all 13 rows matched against itself within the bounding range, thus the score is 1. So this helps show that the score arithmetic explained above is working.
Here is the explain


 GroupAggregate  (cost=134704606.22..820108973.55 rows=120395600 width=32) (actual time=446.620..8087.029 rows=403181 loops=1)
   Output: f1.bar_id, f2.bar_id, count(f2.tx), ((count(f2.tx))::double precision / sqrt((((((count(foo.bar_id)))::double precision ^ '2'::double precision) + (((SubPlan 1))::double precision ^ '2'::double precision)) / '2'::double precision))), (count(foo.bar_id))
   Group Key: f1.bar_id, f2.bar_id, (count(foo.bar_id))
   ->  Sort  (cost=134704606.22..136368781.44 rows=665670089 width=24) (actual time=446.530..675.656 rows=485559 loops=1)
         Output: f1.bar_id, f2.bar_id, (count(foo.bar_id)), f2.tx
         Sort Key: f2.bar_id, (count(foo.bar_id))
         Sort Method: quicksort  Memory: 50223kB
         ->  Nested Loop  (cost=6.72..23498443.66 rows=665670089 width=24) (actual time=0.082..199.911 rows=485559 loops=1)
               Output: f1.bar_id, f2.bar_id, (count(foo.bar_id)), f2.tx
               ->  Nested Loop  (cost=6.13..11.53 rows=50 width=20) (actual time=0.045..0.075 rows=13 loops=1)
                     Output: (count(foo.bar_id)), f1.bar_id, f1.tx
                     ->  Aggregate  (cost=5.57..5.58 rows=1 width=8) (actual time=0.040..0.041 rows=1 loops=1)
                           Output: count(foo.bar_id)
                           ->  Index Only Scan using foo_bar_id_tx_idx on api.foo  (cost=0.57..5.44 rows=50 width=4) (actual time=0.032..0.034 rows=13 loops=1)
                                 Output: foo.bar_id, foo.tx
                                 Index Cond: (foo.bar_id = 2)
                                 Heap Fetches: 0
                     ->  Index Only Scan using foo_bar_id_tx_idx on api.foo f1  (cost=0.57..5.44 rows=50 width=12) (actual time=0.004..0.020 rows=13 loops=1)
                           Output: f1.bar_id, f1.tx
                           Index Cond: (f1.bar_id = 2)
                           Heap Fetches: 0
               ->  Index Only Scan using foo_tx_idx on api.foo f2  (cost=0.58..336834.62 rows=13313402 width=12) (actual time=0.021..11.343 rows=37351 loops=13)
                     Output: f2.tx, f2.bar_id
                     Index Cond: ((f2.tx >= (f1.tx - (('15'::double precision * f1.tx) / '1000000'::double precision))) AND (f2.tx <= (f1.tx + (('15'::double precision * f1.tx) / '1000000'::double precision))))
                     Heap Fetches: 0
   SubPlan 1
     ->  Aggregate  (cost=5.57..5.58 rows=1 width=8) (actual time=0.017..0.018 rows=1 loops=403181)
           Output: count(foo_1.bar_id)
           ->  Index Only Scan using foo_bar_id_tx_idx on api.foo foo_1  (cost=0.57..5.44 rows=50 width=4) (actual time=0.004..0.012 rows=73 loops=403181)
                 Output: foo_1.bar_id, foo_1.tx
                 Index Cond: (foo_1.bar_id = f2.bar_id)
                 Heap Fetches: 0
 Planning Time: 0.384 ms
 Execution Time: 8124.724 ms
(34 rows)

The execution time can vary, between 5 seconds, all the way to 30+ seconds. Depends how many rows get returned. Some are 200k, others are 1.6M rows and up.

One thing that I'm curious about is that the query planner drastically over estimates the row count, I'm not 100% sure if this is something to be concerned about, or if this is due to the nature of the data and can be ignored, and enough statistics simply cannot be made. I've had estimates shoot into the billions even if only a couple million are returned.

My main questions are, is this query a good way about going it, any possible performance increase suggestions.

Would partitioning based on TX range be a possibility of increasing performance even more since that could lower the possible rows it has to look at.

Maybe this is the best we're going to get, but I'm not sure if there are better ways to go about optimizing this as much as we can. We can work with this for now, but I'm hoping maybe there are some other suggestions I can try to squeeze out as much performance I can.

Best Answer

You are fundamentally doing a lot of work, and doing a lot work takes a lot of time. It is not clear you can do much better here.

One possibility is that you pre-compute the counts for the whole table:

select bar_id, count(bar_id) from api.foo group by bar_id

And join to that rather than running the subselect inside the "sqrt".

I see no reason to think that partitioning would help in this case.

The bad estimates are unlikely to be problem, because it is not clear what other plan would be chosen if the estimates were correct. The fundamental problem there is that PostgreSQL doesn't know how tight of a range is indicated by your BETWEEN clause.