You write:
Each customer can have multiple sites, but only one should be
displayed in this list.
Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site
is to be picked.
Either way, it does not matter much here. Your EXPLAIN
shows only 5026 rows for the site
scan (5018 for the customer
scan). So hardly any customer actually has more than one site. Did you ANALYZE
your tables before running EXPLAIN
?
From the numbers I see in your EXPLAIN
, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?
Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.
If this query is your bottleneck, I would suggest you implement a materialized view.
After you provided more information I find that my diagnosis pretty much holds.
The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.
If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id
.
Then you can query:
SELECT *
FROM materialized_view
WHERE mv_id >= 2700
AND mv_id < 2800;
This will perform very fast. LIMIT
/ OFFSET
cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.
pgAdmin timing
When you execute a query from the query tool, the message pane shows something like:
Total query runtime: 62 ms.
And the status line shows the same time. I quote pgAdmin help about that:
The status line will show how long the last query took to complete. If
a dataset was returned, not only the elapsed time for server execution
is displayed, but also the time to retrieve the data from the server
to the Data Output page.
If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE
or the built in Shift + F7
keyboard shortcut or Query -> Explain analyze
. Then, at the bottom of the explain output you get something like this:
Total runtime: 0.269 ms
In a search, I would like to get all the rows that exactly match the
bit string.
Use a B-Tree index, the default type. I don't see a case for a GIN index here.
Up to 1000 bits result in up to 133 bytes (or slightly more) storage size on disk for a bit varying
type.
SELECT pg_column_size(repeat('1', 1000)::varbit) -- 133
Not that much. A plain B-Tree index should do. But maybe the column is big enough that the following tricks improve performance.
If a small part of the bitstring column is distinctive enough to narrow your search down to few hits, an index on an expression might give you better performance, because the smaller index can fit into RAM and is faster to process all around. Don't bother for small tables, the overhead would eat the benefit. But could make a big difference for big tables.
Example
Given table:
CREATE TABLE tbl(id serial PRIMARY KEY, b_col varbit);
If the first 10 bit are enough to narrow down a search to a few hits, you could create an index on the expression b_col::bit(10)
. Casting to bin(n)
truncates the bitstring
to n bit.
CREATE INDEX tbl_b_col10_idx ON tbl ((b_col::bit(10)))
Extra parentheses are required for the cast operator in an index definition. See:
Then, instead of the query
SELECT * FROM tbl WHERE b_col = '1111011110111101'::varbit; -- 16 bit
You would use:
SELECT *
FROM tbl
WHERE b_col::bit(10) = '1111011110111101'::bit(10) -- utilize index
AND b_col = '1111011110111101'::varbit; -- filter to exact match
Be aware that shorter values are padded with 0
's to the right (least significant bits) when cast to bit(n)
.
In a real world application this starts to make sense with several 100s of bits. Test for the turning point.
Optimize further
Since most installations operate with a MAXALIGN
of 8 bytes (64 bit OS) (more details here), your index size is the same for any data not exceeding 8 bytes. Effectively, per row:
4 bytes item identifier
8 bytes for the index tuple header (or 23 + 1 byte for heap tuples)
? actual space for data
? padding to the nearest multiple of 8 bytes
Plus some minor overhead per page and index / table. Details in the manual or in this related answer on SO.
Therefore, you should be able to further optimize the above approach. Take the first 64 bit (or last or whatever is most distinctive and works for you), cast it to bigint
and build an index on this expression.
CREATE INDEX tbl_b_col64_idx ON tbl ((b_col::bit(64)::bigint))
I cast twice (b_col::bit(64)::bigint
) for there is no cast defined between varbit
and bigint
. Details in this related answer on SO:
Effectively, this is just a very fast and simple hash function, where the hash value also allows to look up ranges of values. Depending on exact requirements you could go one step further and use any IMMUTABLE
hash function - like md5()
. Details in the answer linked above.
The query to go along with that:
SELECT *
FROM tbl
WHERE b_col::bit(64)::bigint = '1111011110111101'::bit(64)::bigint -- utilize index
AND b_col = '1111011110111101'::varbit; -- narrow down to exact match
The resulting index should be just as big as the one in the first example, but queries should be considerably faster for three reasons:
The index typically returns much fewer hits (64 bit of information vs. 10 bit)
Postgres can work with integer arithmetic, which should be faster, even for a plain =
operation. (Didn't test to verify that.)
The type integer
has no overhead like varbit
- 5 or 8 bytes. (In my installation 5 bytes for up to 960 bit, 8 bytes for more).
Effectively, to keep the index at its minimum size, you can only pack 24 bit into a varbit
index - compared to 64 bit of information for a bigint
index.
CLUSTER
In such a case CLUSTER
should improve performance:
CLUSTER TABLE tbl USING tbl_b_col10_idx;
It's a one-time operation and has to be repeated at intervals of your design. Be sure to read the manual on CLUSTER
if you want to use that. Or consider the alternative pg_repack. Details:
If the first 64 bit of your values are unique most of the time, CLUSTER
will barely help, since the index scan will return a single row in most cases. If not, CLUSTER
will help a lot. Consequently, the effect will be far greater for the first example with the less optimized index.
Best Answer
The optimal DB design always depends on the complete picture.
Generally, there is hardly anything faster than a plain btree index for your simple query. Introducing
json
orjsonb
or even a plain array type in combination with a GIN index will most likely make it slower.With your original table this multicolumn index with the right sort order should be a game changer for your common query:
This way, Postgres can just pick the top 100 rows from the index directly. Super fast.
Related:
Your current indexes
receiver_idx
andactions_time_idx
may lose their purpose.Next to the perfect index, storage size is an important factor for big tables, so avoiding duplication may be the right idea. But that can be achieved in various ways. Have you considered good old normalization, yet?
Also note the changed order of columns in table
action
, saves a couple of bytes per row, which makes a couple of GB for billions of rows.See:
Your common query changes slightly to:
Drawback: it's much harder to make your common query fast now. Related:
The quick (and slightly dirty) fix: include the
time
column in tablereceiver_action
redundantly (or move it there).Create an index:
INCLUDE
requires Postgres 11 or later. See:And use this query:
Depending on the exact story behind
one set of data may create 3 separate rows
more may be possible - even 3 separate columns in table action instead of the n:m implementation and a expression GIN index ... But that's going in too deep. I tap out here.