You write:
Each customer can have multiple sites, but only one should be
displayed in this list.
Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site
is to be picked.
Either way, it does not matter much here. Your EXPLAIN
shows only 5026 rows for the site
scan (5018 for the customer
scan). So hardly any customer actually has more than one site. Did you ANALYZE
your tables before running EXPLAIN
?
From the numbers I see in your EXPLAIN
, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?
Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.
If this query is your bottleneck, I would suggest you implement a materialized view.
After you provided more information I find that my diagnosis pretty much holds.
The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.
If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id
.
Then you can query:
SELECT *
FROM materialized_view
WHERE mv_id >= 2700
AND mv_id < 2800;
This will perform very fast. LIMIT
/ OFFSET
cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.
pgAdmin timing
When you execute a query from the query tool, the message pane shows something like:
Total query runtime: 62 ms.
And the status line shows the same time. I quote pgAdmin help about that:
The status line will show how long the last query took to complete. If
a dataset was returned, not only the elapsed time for server execution
is displayed, but also the time to retrieve the data from the server
to the Data Output page.
If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE
or the built in Shift + F7
keyboard shortcut or Query -> Explain analyze
. Then, at the bottom of the explain output you get something like this:
Total runtime: 0.269 ms
There are a few problems with your tables. I'll try to address the foreign keys first, since you question asked about them :)
But before that, we should realize that the two sets of tables (the first three you created and the second set, which you created after dropping the first set) are the same. Of course, the definition of Table3
in your second attempt has syntax and logical errors, but the basic idea is:
CREATE TABLE table3 (
"ID" bigint NOT NULL DEFAULT '0',
"DataID" bigint DEFAULT NULL,
"Address" numeric(20) DEFAULT NULL,
"Data" bigint DEFAULT NULL,
PRIMARY KEY ("ID"),
FOREIGN KEY ("DataID") REFERENCES Table1("DataID") on delete cascade on update cascade,
FOREIGN KEY ("Address") REFERENCES Table2("Address") on delete cascade on update cascade
);
This definition tell PostgreSQL roughly the following: "Create a table with four columns, one will be the primary key (PK), the others can be NULL
. If a new row is inserted, check DataID
and Address
: if they contain a non-NULL value (say 27856), then check Table1
for DataID
Λ™and Table2
for Address
. If there is no such value in those tables, then return an error." This last point which you've seen first:
ERROR: insert or update on table "Table3" violates foreign key constraint
"Table3_DataID_fkey" DETAIL: Key (DataID)=(27856) is not present in table "Table1".
So simple: if there is no row in Table1
where DataID = 27856
, then you can't insert that row into Table3
.
If you need that row, you should first insert a row into Table1
with DataID = 27856
, and only then try to insert into Table3
. If this seems to you not what you want, please describe in a few sentences what you want to achieve, and we can help with a good design.
And now about the other problems.
You define your PKs as
CREATE all_your_tables (
first_column NOT NULL DEFAULT '0',
[...]
PRIMARY KEY ("ID"),
A primary key means that all the items in it are different from each other, that is, the values are UNIQUE
. If you give a static DEFAULT
(like '0'
) to a UNIQUE
column, you will experience bad surprises all the time. This is what you got in your third error message.
Furthermore, '0'
means a text string, but not a number (bigint
or numeric
in your case). Use simply 0
instead (or don't use it at all, as I written above).
And a last point (I may be wrong here): in Table2
, your Address
field is set to numeric(20)
. At the same time, it is the PK of the table. The column name and the data type suggests that this address can change in the future. If this is true, than it is a very bad choice for a PK. Think about the following scenario: you have an address '1234567890454', which has a child in Table3
like
ID DataID Address Data
123 3216547 1234567890454 654897564134569
Now that address happens to change to something other. How do you make your child row in Table3
follow its parent to the new address? (There are solutions for this, but can cause much confusion.) If this is your case, add an ID column to your table, which will not contain any information from the real world, it will simply serve as an identification value (that is, ID) for an address.
Best Answer
(For similar relations simple hierarchy are more sensible... (id, parent))
Apart from that the last foreign key only slows down a bit the DML queries.