You write:
Each customer can have multiple sites, but only one should be
displayed in this list.
Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site
is to be picked.
Either way, it does not matter much here. Your EXPLAIN
shows only 5026 rows for the site
scan (5018 for the customer
scan). So hardly any customer actually has more than one site. Did you ANALYZE
your tables before running EXPLAIN
?
From the numbers I see in your EXPLAIN
, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?
Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.
If this query is your bottleneck, I would suggest you implement a materialized view.
After you provided more information I find that my diagnosis pretty much holds.
The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.
If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id
.
Then you can query:
SELECT *
FROM materialized_view
WHERE mv_id >= 2700
AND mv_id < 2800;
This will perform very fast. LIMIT
/ OFFSET
cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.
pgAdmin timing
When you execute a query from the query tool, the message pane shows something like:
Total query runtime: 62 ms.
And the status line shows the same time. I quote pgAdmin help about that:
The status line will show how long the last query took to complete. If
a dataset was returned, not only the elapsed time for server execution
is displayed, but also the time to retrieve the data from the server
to the Data Output page.
If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE
or the built in Shift + F7
keyboard shortcut or Query -> Explain analyze
. Then, at the bottom of the explain output you get something like this:
Total runtime: 0.269 ms
Seems you are running in a weakness of the query planner: The best index is sometimes not used for joining tables. Had a similar problem here:
Algorithm for finding the longest prefix (Chapter "Failed attempt with text_pattern_ops")
In Postgres 9.3 You could try this version with LEFT JOIN LATERAL
:
SELECT *
FROM (
SELECT coord
FROM taduler.postal_code
WHERE postal_code = 'T1K0T4'
LIMIT 1
) pc
LEFT JOIN LATERAL (
SELECT *
FROM public.timezones tz
WHERE ST_Intersects(pc.coord, tz.geom)
) tz ON TRUE;
Something similar Worked for @ypercube's solution in this related answer.
LATERAL
requires Postgres 9.3+, though.
In PostgreSQL 9.1, it might help to encapsulate the first query in a CTE, but I doubt it. (Don't have a PostGis installation here to test.):
WITH pc AS (
SELECT coord
FROM taduler.postal_code
WHERE postal_code = 'T1K0T4'
LIMIT 1
)
SELECT *
FROM pc
JOIN public.timezones tz ON ST_Intersects(pc.coord, tz.geom);
A plpgsql function to encapsulate two separate queries should certainly do the trick:
CREATE OR REPLACE FUNCTION f_get_tz(_pc text)
RETURNS SETOF public.timezones AS
$func$
DECLARE
_coord geom;
BEGIN
SELECT coord
INTO _coord
FROM taduler.postal_code
WHERE postal_code = _pc
LIMIT 1;
RETURN QUERY
SELECT *
FROM public.timezones tz
WHERE ST_Intersects(_coord, tz.geom);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_get_tz('T1K0T4');
Best Answer
There might be hardware issues, too - how should we know? But there are certainly issues with the query.
First of all, remove
DISTINCT
from yourVIEW
definition. It's doing nothing at all (but complicating and slowing things down). Related answer on SO with explanation:Arriving at this (cleaned up) query:
Next, the query looks suspiciously like it's going very wrong. Two uncorrelated joins can multiply rows:
With 150k rows in each table, there is potential for a huge (unintended) Cartesian product. My educated guess, you really want this:
I suspect that
GROUP BY
is also not needed in the outerSELECT
now. Besides fixing the count, this might also be faster by several orders of magnitude, avoiding the proxy cross-join.You could first join to
ipaddrs
(especially if you have predicates filtering rows from it) and then aggregate, or first aggregate in the subquery like displayed. Usefulness of this variant largely depends on data distribution. It's great for fewipaddr
with big counts. Details:Finally, you need index support. Equality between
ipaddr
andaddr
is covered by the default btree indexes of thePRIMARY KEY
. The query on the whole table is probably using a sequential scan anyway.For the "is contained by" operator
<<
operator you'll need a GIN or GiST index. The best option would be the newinet_ops
GiST index operator class in Postgres 9.4 (supports data typesinet
andcidr
):Not sure if the index can be used in a plain
INNER
(orOUTER
) join. Can't test right now. Maybe you need correlated subqueries or aLATERAL
join to utilize the index:Advice for indexing in older versions: