PostgreSQL – Fast Count After Join on Containment Expression

database-designindex-tuningjoin;performancepostgresqlpostgresql-performance

I have three tables in a PostgreSQL database that I'm querying via a view and some joins.

CREATE TABLE network_info (
  network         CIDR          NOT NULL,
  some_info       TEXT          NULL,
  PRIMARY KEY (network)
);

CREATE TABLE ipaddr_info (
  ipaddr          INET          NOT NULL,
  some_info       INT           NULL,
  PRIMARY KEY (ipaddr, some_info)
);

CREATE TABLE ipaddrs (
  addr            INET          NOT NULL,
  PRIMARY KEY (addr)
);

CREATE VIEW ipaddr_summary AS
SELECT DISTINCT
  i.addr                  AS ip_address,
  a.some_info             AS network_info,
  COUNT(b.ipaddr)         AS ip_info_count
FROM ipaddrs AS i
LEFT JOIN network_info AS a
  ON (i.addr << a.network)
LEFT JOIN ipaddr_info AS b
  ON (i.addr = b.ipaddr)
GROUP BY i.addr, a.some_info
;

All of the tables have ~150K rows right now, and it takes a really long time (~3 hours) to run SELECT * from ipaddr_summary; on an Intel Pentium 4 2.8GHz dual core with 2G of memory running PostgreSQL 9.3.

Is there a way I can restructure or optimize this particular schema or view to make the query faster, or is it a hardware issue? I'm going to spin up a beefy VM in the cloud and test, but wanted to see if there's a way to optimize w/out just throwing more hardware at it.

Best Answer

There might be hardware issues, too - how should we know? But there are certainly issues with the query.

First of all, remove DISTINCT from your VIEW definition. It's doing nothing at all (but complicating and slowing things down). Related answer on SO with explanation:

PostgreSQL - Slow query joining on a VIEW

Arriving at this (cleaned up) query:

SELECT i.addr      AS ip_address
     , a.some_info AS network_info
     , COUNT(b.ipaddr) AS ip_info_count
FROM   ipaddrs           i
LEFT   JOIN ipaddr_info  b  ON i.addr = b.ipaddr
LEFT   JOIN network_info a  ON i.addr << a.network
GROUP  BY 1,2;

Next, the query looks suspiciously like it's going very wrong. Two uncorrelated joins can multiply rows:

Two SQL LEFT JOINS produce incorrect result

With 150k rows in each table, there is potential for a huge (unintended) Cartesian product. My educated guess, you really want this:

SELECT addr        AS ip_address
     , a.some_info AS network_info
     , b.ip_info_count
FROM   ipaddrs i
LEFT   JOIN (
   SELECT  ipaddr AS addr, count(*) AS ip_info_count
   FROM    ipaddr_info
   GROUP   BY 1
  ) b USING (addr)
LEFT   JOIN network_info a ON i.addr << a.network

I suspect that GROUP BY is also not needed in the outer SELECT now. Besides fixing the count, this might also be faster by several orders of magnitude, avoiding the proxy cross-join.

You could first join to ipaddrs (especially if you have predicates filtering rows from it) and then aggregate, or first aggregate in the subquery like displayed. Usefulness of this variant largely depends on data distribution. It's great for few ipaddr with big counts. Details:

Slow queries related to subqueries using aggregation

Finally, you need index support. Equality between ipaddr and addr is covered by the default btree indexes of the PRIMARY KEY. The query on the whole table is probably using a sequential scan anyway.

For the "is contained by" operator << operator you'll need a GIN or GiST index. The best option would be the new inet_ops GiST index operator class in Postgres 9.4 (supports data types inet and cidr):

CREATE INDEX ON network_info USING gist (network inet_ops);

Not sure if the index can be used in a plain INNER (or OUTER) join. Can't test right now. Maybe you need correlated subqueries or a LATERAL join to utilize the index:

SELECT addr AS ip_address
     , a.network_info
     , b.ip_info_count
FROM   ipaddrs i
LEFT   JOIN (
   SELECT  ipaddr AS addr, count(*) AS ip_info_count
   FROM    ipaddr_info
   GROUP   BY 1
  ) b USING (addr)
LEFT   JOIN LATERAL (
   SELECT some_info AS network_info
   FROM   network_info
   WHERE  network >> i.addr
   ) a ON TRUE;

Advice for indexing in older versions:

PostgreSQL field data type for IPv4 addresses

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql subquery speed much slower than individual queries

Seems you are running in a weakness of the query planner: The best index is sometimes not used for joining tables. Had a similar problem here:
Algorithm for finding the longest prefix (Chapter "Failed attempt with text_pattern_ops")

In Postgres 9.3 You could try this version with LEFT JOIN LATERAL:

SELECT *
FROM  (
    SELECT coord
    FROM   taduler.postal_code
    WHERE  postal_code = 'T1K0T4'
    LIMIT  1
    ) pc
LEFT JOIN LATERAL (
    SELECT *
    FROM   public.timezones tz
    WHERE  ST_Intersects(pc.coord, tz.geom)
   ) tz ON TRUE;

Something similar Worked for @ypercube's solution in this related answer.
LATERAL requires Postgres 9.3+, though.

In PostgreSQL 9.1, it might help to encapsulate the first query in a CTE, but I doubt it. (Don't have a PostGis installation here to test.):

WITH pc AS (
    SELECT coord
    FROM   taduler.postal_code
    WHERE  postal_code = 'T1K0T4'
    LIMIT  1
    )
SELECT *
FROM   pc
JOIN   public.timezones tz ON ST_Intersects(pc.coord, tz.geom);

A plpgsql function to encapsulate two separate queries should certainly do the trick:

CREATE OR REPLACE FUNCTION f_get_tz(_pc text)
  RETURNS SETOF public.timezones AS
$func$
DECLARE
   _coord geom;
BEGIN

SELECT coord
INTO  _coord
FROM   taduler.postal_code
WHERE  postal_code = _pc
LIMIT  1;

RETURN QUERY
SELECT *
FROM   public.timezones tz
WHERE  ST_Intersects(_coord, tz.geom);

END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_get_tz('T1K0T4');

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql subquery speed much slower than individual queries

Related Question