PostgreSQL – Fast Count After Join on Containment Expression

database-designindex-tuningjoin;performancepostgresqlpostgresql-performance

I have three tables in a PostgreSQL database that I'm querying via a view and some joins.

CREATE TABLE network_info (
  network         CIDR          NOT NULL,
  some_info       TEXT          NULL,
  PRIMARY KEY (network)
);

CREATE TABLE ipaddr_info (
  ipaddr          INET          NOT NULL,
  some_info       INT           NULL,
  PRIMARY KEY (ipaddr, some_info)
);

CREATE TABLE ipaddrs (
  addr            INET          NOT NULL,
  PRIMARY KEY (addr)
);

CREATE VIEW ipaddr_summary AS
SELECT DISTINCT
  i.addr                  AS ip_address,
  a.some_info             AS network_info,
  COUNT(b.ipaddr)         AS ip_info_count
FROM ipaddrs AS i
LEFT JOIN network_info AS a
  ON (i.addr << a.network)
LEFT JOIN ipaddr_info AS b
  ON (i.addr = b.ipaddr)
GROUP BY i.addr, a.some_info
;

All of the tables have ~150K rows right now, and it takes a really long time (~3 hours) to run SELECT * from ipaddr_summary; on an Intel Pentium 4 2.8GHz dual core with 2G of memory running PostgreSQL 9.3.

Is there a way I can restructure or optimize this particular schema or view to make the query faster, or is it a hardware issue? I'm going to spin up a beefy VM in the cloud and test, but wanted to see if there's a way to optimize w/out just throwing more hardware at it.

Best Answer

There might be hardware issues, too - how should we know? But there are certainly issues with the query.

First of all, remove DISTINCT from your VIEW definition. It's doing nothing at all (but complicating and slowing things down). Related answer on SO with explanation:

Arriving at this (cleaned up) query:

SELECT i.addr      AS ip_address
     , a.some_info AS network_info
     , COUNT(b.ipaddr) AS ip_info_count
FROM   ipaddrs           i
LEFT   JOIN ipaddr_info  b  ON i.addr = b.ipaddr
LEFT   JOIN network_info a  ON i.addr << a.network
GROUP  BY 1,2;

Next, the query looks suspiciously like it's going very wrong. Two uncorrelated joins can multiply rows:

With 150k rows in each table, there is potential for a huge (unintended) Cartesian product. My educated guess, you really want this:

SELECT addr        AS ip_address
     , a.some_info AS network_info
     , b.ip_info_count
FROM   ipaddrs i
LEFT   JOIN (
   SELECT  ipaddr AS addr, count(*) AS ip_info_count
   FROM    ipaddr_info
   GROUP   BY 1
  ) b USING (addr)
LEFT   JOIN network_info a ON i.addr << a.network

I suspect that GROUP BY is also not needed in the outer SELECT now. Besides fixing the count, this might also be faster by several orders of magnitude, avoiding the proxy cross-join.

You could first join to ipaddrs (especially if you have predicates filtering rows from it) and then aggregate, or first aggregate in the subquery like displayed. Usefulness of this variant largely depends on data distribution. It's great for few ipaddr with big counts. Details:

Finally, you need index support. Equality between ipaddr and addr is covered by the default btree indexes of the PRIMARY KEY. The query on the whole table is probably using a sequential scan anyway.

For the "is contained by" operator << operator you'll need a GIN or GiST index. The best option would be the new inet_ops GiST index operator class in Postgres 9.4 (supports data types inet and cidr):

CREATE INDEX ON network_info USING gist (network inet_ops);

Not sure if the index can be used in a plain INNER (or OUTER) join. Can't test right now. Maybe you need correlated subqueries or a LATERAL join to utilize the index:

SELECT addr AS ip_address
     , a.network_info
     , b.ip_info_count
FROM   ipaddrs i
LEFT   JOIN (
   SELECT  ipaddr AS addr, count(*) AS ip_info_count
   FROM    ipaddr_info
   GROUP   BY 1
  ) b USING (addr)
LEFT   JOIN LATERAL (
   SELECT some_info AS network_info
   FROM   network_info
   WHERE  network >> i.addr
   ) a ON TRUE;

Advice for indexing in older versions: