How to get the row with the latest value in a TIMESTAMPZ
column? Is an index needed? Does an index change the strategy? Would behavior vary by database (I'm using Postgres 9.4)?
My app records data from a data feed. Another process endlessly queries to get the latest freshest entry. Older data may appear on occasion from secondary sources. So the most recently inserted rows are usually, but not necessarily, the freshest data.
I am using this kind of SQL where when_
is a TIMESTAMP WITH TIME ZONE
column:
SELECT *
FROM my_table_
ORDER BY when_ DESC
LIMIT 1
;
This code works (if no NULL values in data!). But with a possible couple million rows, and a query every 10 seconds, I'm concerned about performance.
Without any index on when_
column, does this statement require a full scan of all rows?
Does adding an index change the performance? Will Postgres automatically scan the index to locate the most recent row, or must I do something to make an index scan happen?
With an index on when_
column, should I change this SQL to use some other approach/strategy of query?
Is there some other way to gather freshly inserted rows? I'm using UUID rather than SERIAL types for my primary key, and may federate data amongst multiple database instances, so that rules out checking for ever-increasing integer numbers.
Best Answer
Basic answers
Since you select a couple of big columns an index-only scan is probably not a viable option.
While the column isn't defined
NOT NULL
, addNULLS LAST
to the sort order to make it work in any case, even withNULL
values. Ideally, use the clause in the corresponding index as well:Yes. Without index, there is no other option left. (Well, there is also table partitioning where an index on key columns(s) is not strictly required, and it could assist with partition pruning. But you would typically have an index on key columns there, too.)
Basically, this is the perfect query. There are options in combination with advanced indexing:
Advanced technique
Assuming a
NOT NULL
column. Else, addNULLS LAST
to index and queries as suggested above.You have a constant influx of rows with later
when_
. Assuming the latest_when
constantly increases and never (or rarely) decreases (latest rows deleted / updated), you can use a very small partial index.Basic implementation:
Run your query once to retrieve the latest
when_
, subtract a safe margin (to be safe against losing the latest rows) and create anIMMUTABLE
function based on it. Basically a "fake global constant":PARALLEL SAFE
only in Postgres 9.6 or later.Create a partial index excluding older rows:
With millions of rows, the difference in size can be dramatic. And this only makes sense with a much smaller index. Just half the size or something would not cut it. Index access itself is not slowed much by a bigger index. It's mostly the sheer size of the index, which needs to be read and cached. (And possibly avoiding additional index writes, but hardly in your case.)
Use the function in all related queries. Include the same
WHERE
condition (even if logically redundant) to convince the query planner the index is applicable. For the simple query:The size of the index grows with new (later) entries. Recreate the function with a later timestamp and
REINDEX
from time to time with no or little concurrent access. Only reindex after a relevant number of rows has been added. A couple of thousand entries won't matter much. We are doing this to cut off millions.The beauty of it: queries don't change.
Implementation with function to update the partial index automatically:
More general advice: