You write:
Each customer can have multiple sites, but only one should be
displayed in this list.
Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site
is to be picked.
Either way, it does not matter much here. Your EXPLAIN
shows only 5026 rows for the site
scan (5018 for the customer
scan). So hardly any customer actually has more than one site. Did you ANALYZE
your tables before running EXPLAIN
?
From the numbers I see in your EXPLAIN
, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?
Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.
If this query is your bottleneck, I would suggest you implement a materialized view.
After you provided more information I find that my diagnosis pretty much holds.
The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.
If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id
.
Then you can query:
SELECT *
FROM materialized_view
WHERE mv_id >= 2700
AND mv_id < 2800;
This will perform very fast. LIMIT
/ OFFSET
cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.
pgAdmin timing
When you execute a query from the query tool, the message pane shows something like:
Total query runtime: 62 ms.
And the status line shows the same time. I quote pgAdmin help about that:
The status line will show how long the last query took to complete. If
a dataset was returned, not only the elapsed time for server execution
is displayed, but also the time to retrieve the data from the server
to the Data Output page.
If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE
or the built in Shift + F7
keyboard shortcut or Query -> Explain analyze
. Then, at the bottom of the explain output you get something like this:
Total runtime: 0.269 ms
The physical storage for rows is described in the docs in Database Page Layout. The column contents for the same row are all stored in the same disk page, with the notable exception of TOAST'ed contents (too large to fit in a page). Contents are extracted sequentially within each row, as explained:
To read the data you need to examine each attribute in turn. First
check whether the field is NULL according to the null bitmap. If it
is, go to the next. Then make sure you have the right alignment. If
the field is a fixed width field, then all the bytes are simply
placed.
In the simplest case (no TOAST'ed columns), postgres will fetch the entire row even if few columns are needed. So in this case, the answer is yes, having more columns may have a clear adverse impact on waster buffer cache, particularly if the column contents are large while still under the TOAST threshold.
Now the TOAST case: when an individual field exceeds ~2kB , the engine stores the field contents into a separate physical table. It also comes into play when the entire row doesn't fit into a page (8kB by default): some of the fields are moved to TOAST storage. Doc says:
If it's a variable length field (attlen = -1) then it's a bit
more complicated. All variable-length data types share the common
header structure struct varlena, which includes the total length of
the stored value and some flag bits. Depending on the flags, the data
can be either inline or in a TOAST table; it might be compressed, too
TOAST'ed contents are not fetched when they're not explicitly needed, so their effect on the total number of pages to fetch is small (a few bytes per column). This explains the results in @dezso's answer.
As for writes, each row with all its columns is entirely rewritten on each UPDATE, no matter what columns are changed. So having more columns is obviously more costly for writes.
Best Answer
Your considerations are sound, but all of this is already taken care of automatically by PostgreSQL:
A NULL value takes up no space at all in a PostgreSQL table row.
A
varchar
will only occupy as much space as the value actually has bytes. The length limit (type modifier) does not waste any space.For rows that are really wide (2000 bytes and above), the
varchar
s are first compressed and then, if the result is still too large, stored externally in a TOAST table. If youSELECT
from such a row, the toasted values are not read unless you specify the column values are requested.But if the size limit for a row is 850 bytes, that is too small to get in that range.
You can check the estimated average row size in bytes by looking at the
width
column in the output ofEXPLAIN SELECT * FROM tablename
.PostgreSQL v12 introduced the
toast_tuple_target
storage parameter which would enable you to reduce the limit for TOASTing data. If you tune that, and after rewriting the table withVACUUM (FULL)
, PostgreSQL can store the data in the way you want to. Then you can see if that actually gets you better performance.