You write:
Each customer can have multiple sites, but only one should be
displayed in this list.
Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site
is to be picked.
Either way, it does not matter much here. Your EXPLAIN
shows only 5026 rows for the site
scan (5018 for the customer
scan). So hardly any customer actually has more than one site. Did you ANALYZE
your tables before running EXPLAIN
?
From the numbers I see in your EXPLAIN
, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?
Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.
If this query is your bottleneck, I would suggest you implement a materialized view.
After you provided more information I find that my diagnosis pretty much holds.
The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.
If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id
.
Then you can query:
SELECT *
FROM materialized_view
WHERE mv_id >= 2700
AND mv_id < 2800;
This will perform very fast. LIMIT
/ OFFSET
cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.
pgAdmin timing
When you execute a query from the query tool, the message pane shows something like:
Total query runtime: 62 ms.
And the status line shows the same time. I quote pgAdmin help about that:
The status line will show how long the last query took to complete. If
a dataset was returned, not only the elapsed time for server execution
is displayed, but also the time to retrieve the data from the server
to the Data Output page.
If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE
or the built in Shift + F7
keyboard shortcut or Query -> Explain analyze
. Then, at the bottom of the explain output you get something like this:
Total runtime: 0.269 ms
In a GiST index, the order of columns has a different significance than in a B-tree index. Per documentation:
A multicolumn GiST index can be used with query conditions that
involve any subset of the index's columns. Conditions on additional
columns restrict the entries returned by the index, but the condition
on the first column is the most important one for determining how much
of the index needs to be scanned. A GiST index will be relatively
ineffective if its first column has only a few distinct values, even
if there are many distinct values in additional columns.
In short: put the most selective columns first.
Your EXPLAIN
output shows that the condition on pid
is more selective (rows=7836
) than the one on outline
(rows=63112
). If that can be generalized (a single example may be misleading) I suggest this alternative:
CREATE INDEX inventory_compound_idx ON portal.inventory USING gist (pid, outline);
If most of your (important) queries include conditions on both columns, a multicolumn index may serve you well. Else, single columns may be better overall.
Table layout
This is an educated guess since I don't have complete information.
Don't use oid
as column name. It's easy to confuse with the OID
.
Don't use the name date
for a timestamp column. Or rather: don't use the name date
for any column, don't use names of base-types for identifiers at all. Can lead to confusing mistakes and error messages.
Create a lookup table for types and only put a small integer type_id
into the big table. Pack fixed-length types tightly so not to waste space to padding. Details.
I prefer the type text
(or varchar
without length limit) over varchar(n)
. Details.
For example:
CREATE TABLE portal.inventory (
inventory_id bigint PRIMARY KEY
,type_id integer NOT NULL REFERENCES inv_type(type_id)
,pid integer NOT NULL
,size bigint NOT NULL
,ts timestamp NOT NULL
,outline geography(Polygon,4326)
,product_name text
,path text
);
Best Answer
if it is hanging forever, without significant CPU or IO activity, it is waiting on a lock. The two things I would try here would be checking pg_stat_activity to see if something else is running that might conflict. Maybe someone else has an open transaction in another window that has conflicting locks? Then I would check pg_locks.
Then I would look at pg_terminate_backend() on the offending process and try again.
However create table is a weird one. I can't think, off-hand, of any case where this should be blocked. Wondering what kinds of locks are required for this and under what circumstances they can conflict.
Edit: It occurs to me that the locking could be lower level than Pg-level locks and involve semaphores against shmem segments. But if that's the case your best bet is to take this up with Amazon since it would be more of an OS issue.