For Postgres 9.1 or later:
CREATE INDEX idx_time_limits_ts_inverse
ON time_limits (id_phi, start_date_time, end_date_time DESC);
In most cases the sort order of an index is hardly relevant. Postgres can scan backwards practically as fast. But for range queries on multiple columns it can make a huge difference. Closely related:
Consider your query:
SELECT *
FROM time_limits
WHERE id_phi = 0
AND start_date_time <= '2010-08-08 00:00'
AND end_date_time >= '2010-08-08 00:05';
Sort order of the first column id_phi
in the index is irrelevant. Since it's checked for equality (=
), it should come first. You got that right. More in this related answer:
Postgres can jump to id_phi = 0
in next to no time and consider the following two columns of the matching index. These are queried with range conditions of inverted sort order (<=
, >=
). In my index, qualifying rows come first. Should be the fastest possible way with a B-Tree index1:
- You want
start_date_time <= something
: index has the earliest timestamp first.
- If it qualifies, also check column 3.
Recurse until the first row fails to qualify (super fast).
- You want
end_date_time >= something
: index has the latest timestamp first.
- If it qualifies, keep fetching rows until the first one doesn't (super fast).
Continue with next value for column 2 ..
Postgres can either scan forward or backward. The way you had the index, it has to read all rows matching on the first two columns and then filter on the third. Be sure to read the chapter Indexes and ORDER BY
in the manual. It fits your question pretty well.
How many rows match on the first two columns?
Only few with a start_date_time
close to the start of the time range of the table. But almost all rows with id_phi = 0
at the chronological end of the table! So performance deteriorates with later start times.
Planner estimates
The planner estimates rows=62682
for your example query. Of those, none qualify (rows=0
). You might get better estimates if you increase the statistics target for the table. For 2.000.000 rows ...
ALTER TABLE time_limits ALTER start_date_time SET STATISTICS 1000;
ALTER TABLE time_limits ALTER end_date_time SET STATISTICS 1000;
... might pay. Or even higher. More in this related answer:
I guess you don't need that for id_phi
(only few distinct values, evenly distributed), but for the timestamps (lots of distinct values, unevenly distributed).
I also don't think it matters much with the improved index.
CLUSTER
/ pg_repack / pg_squeeze
If you want it faster, yet, you could streamline the physical order of rows in your table. If you can afford to lock your table exclusively (at off hours for instance), rewrite your table and order rows according to the index with CLUSTER
:
CLUSTER time_limits USING idx_time_limits_inversed;
Or consider pg_repack or the later pg_squeeze, which can do the same without exclusive lock on the table.
Either way, the effect is that fewer blocks need to be read from the table and everything is pre-sorted. It's a one-time effect deteriorating over time with writes on the table fragmenting the physical sort order.
GiST index in Postgres 9.2+
1 With pg 9.2+ there is another, possibly faster option: a GiST index for a range column.
There are built-in range types for timestamp
and timestamp with time zone
: tsrange
, tstzrange
. A btree index is typically faster for an additional integer
column like id_phi
. Smaller and cheaper to maintain, too. But the query will probably still be faster overall with the combined index.
Change your table definition or use an expression index.
For the multicolumn GiST index at hand you also need the additional module btree_gist
installed (once per database) which provides the operator classes to include an integer
.
The trifecta! A multicolumn functional GiST index:
CREATE EXTENSION IF NOT EXISTS btree_gist; -- if not installed, yet
CREATE INDEX idx_time_limits_funky ON time_limits USING gist
(id_phi, tsrange(start_date_time, end_date_time, '[]'));
Use the "contains range" operator @>
in your query now:
SELECT *
FROM time_limits
WHERE id_phi = 0
AND tsrange(start_date_time, end_date_time, '[]')
@> tsrange('2010-08-08 00:00', '2010-08-08 00:05', '[]')
SP-GiST index in Postgres 9.3+
An SP-GiST index might be even faster for this kind of query - except that, quoting the manual:
Currently, only the B-tree, GiST, GIN, and BRIN index types support multicolumn indexes.
Still true in Postgres 12.
You would have to combine an spgist
index on just (tsrange(...))
with a second btree
index on (id_phi)
. With the added overhead, I'm not sure this can compete.
Related answer with a benchmark for just a tsrange
column:
get the most frequently appearing user_code
where the month is specified
Since user_code
is the primary key, that question would be nonsense. There can never be more than one. I assume you meant invite_code
?
Just add a WHERE
clause. And since the column can be NULL, also consider excluding NULL values:
SELECT invite_code, COUNT(*) AS counted
FROM invite_table
WHERE month = 'May' -- or whatever is stored in your varchar(3) column
AND invite_code IS NOT NULL -- exclude NULL
GROUP BY invite_code
ORDER BY counted DESC, invite_code -- to break ties in deterministic fashion
LIMIT 10;
Month, date, timestamp?
A month column as varchar(3)
doesn't seem very useful if there can be data for more than a single year. I would use data type date
for it. You can format that with to_char()
any way you like for presentation. Like:
SELECT to_char(date '2017-12-01', 'Mon'); -- 'Dec'
The column could look like this (also addressing your comment):
...
, inserted_at date DEFAULT CURRENT_DATE
...
The default value is entered when the column is omitted in an INSERT
statement.
Or, if really only the month is relevant:
... DEFAULT date_trunc('month', now())::date
Or store the complete timestamptz
(8 bytes, that's what I would probably do):
...
, inserted_at timestamptz DEFAULT now()
...
Read the manual here and here.
And be aware that date and timestamp depend on your current time zone setting. Details:
Best Answer
The data type
uuid
is perfectly suited for the task. It only occupies 16 bytes as opposed to 37 bytes in RAM for thevarchar
ortext
representation. (Or 33 bytes on disk, but the odd number would require padding in many cases to make it 40 bytes effectively.) And theuuid
type has some more advantages.Example:
See:
You might consider other (cheaper) hashing functions if you don't need the cryptographic component of md5, but I would go with md5 for your use case (mostly read-only).
A word of warning: For your case (
immutable once written
) a functionally dependent (pseudo-natural) PK is fine. But the same would be a pain where updates ontext
are possible. Think of correcting a typo: the PK and all depending indexes, FK columns in "dozens of other tables" and other references would have to change as well. Table and index bloat, locking issues, slow updates, lost references, ...If
text
can change in normal operation, a surrogate PK would be a better choice. I suggest abigserial
column (range-9223372036854775808 to +9223372036854775807
- that's nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion) distinct values forbillions of rows
. Might be a good idea in any case: 8 instead of 16 bytes for dozens of FK columns and indexes!). Or a random UUID for much bigger cardinalities or distributed systems. You can always store said md5 (asuuid
) additionally to find rows in the main table from the original text quickly. Related:As for your query:
To address @Daniel's comment: If you prefer a representation without hyphens, remove the hyphens for display:
But I wouldn't bother. The default representation is just fine. And the problem's really not the representation here.
If other parties should have a different approach and throw strings without hyphens into the mix, that's no problem, either. Postgres accepts several reasonable text representations as input for a
uuid
. The manual:What's more, the
md5()
function returnstext
, you would usedecode()
to convert tobytea
and the default representation of that is:You would have to
encode()
again to get the original text representation:To top it off, values stored as
bytea
would occupy 20 bytes in RAM (and 17 bytes on disk, 24 with padding) due to the internalvarlena
overhead, which is particularly unfavorable for size and performance of simple indexes.Everything works in favor of a
uuid
here.