Postgresql – Row with latest timestamp

indexindex-tuningperformancepostgresqlquery-performance

How to get the row with the latest value in a TIMESTAMPZ column? Is an index needed? Does an index change the strategy? Would behavior vary by database (I'm using Postgres 9.4)?

My app records data from a data feed. Another process endlessly queries to get the latest freshest entry. Older data may appear on occasion from secondary sources. So the most recently inserted rows are usually, but not necessarily, the freshest data.

I am using this kind of SQL where when_ is a TIMESTAMP WITH TIME ZONE column:

SELECT *
FROM my_table_ 
ORDER BY when_ DESC
LIMIT 1
;

This code works (if no NULL values in data!). But with a possible couple million rows, and a query every 10 seconds, I'm concerned about performance.

Without any index on when_ column, does this statement require a full scan of all rows?

Does adding an index change the performance? Will Postgres automatically scan the index to locate the most recent row, or must I do something to make an index scan happen?

With an index on when_ column, should I change this SQL to use some other approach/strategy of query?

Is there some other way to gather freshly inserted rows? I'm using UUID rather than SERIAL types for my primary key, and may federate data amongst multiple database instances, so that rules out checking for ever-increasing integer numbers.

Best Answer

Basic answers

Since you select a couple of big columns an index-only scan is probably not a viable option.

This code works (if no NULL values in data!)

While the column isn't defined NOT NULL, add NULLS LAST to the sort order to make it work in any case, even with NULL values. Ideally, use the clause in the corresponding index as well:

SELECT <some big columns>
FROM   my_table_ 
ORDER  BY when_ DESC NULLS LAST
LIMIT  1;

PostgreSQL sort by datetime asc, null first?

Without any index on when_ column, does this statement require a full scan of all rows?

Yes. Without index, there is no other option left. (Well, there is also table partitioning where an index on key columns(s) is not strictly required, and it could assist with partition pruning. But you would typically have an index on key columns there, too.)

With an index on when_ column, should I change this SQL to use some other approach/strategy of query?

Basically, this is the perfect query. There are options in combination with advanced indexing:

Advanced technique

Assuming a NOT NULL column. Else, add NULLS LAST to index and queries as suggested above.

You have a constant influx of rows with later when_. Assuming the latest _when constantly increases and never (or rarely) decreases (latest rows deleted / updated), you can use a very small partial index.

Basic implementation:

Run your query once to retrieve the latest when_, subtract a safe margin (to be safe against losing the latest rows) and create an IMMUTABLE function based on it. Basically a "fake global constant":
```
CREATE OR REPLACE FUNCTION f_when_cutoff()
  RETURNS timestamptz LANGUAGE sql COST 1 IMMUTABLE PARALLEL SAFE AS
$$SELECT timestamptz '2015-07-25 01:00+02'$$;
```
PARALLEL SAFE only in Postgres 9.6 or later.
Create a partial index excluding older rows:
```
CREATE INDEX my_table_when_idx ON my_table_ (when_ DESC)
WHERE when_ > f_when_cutoff();
```
With millions of rows, the difference in size can be dramatic. And this only makes sense with a much smaller index. Just half the size or something would not cut it. Index access itself is not slowed much by a bigger index. It's mostly the sheer size of the index, which needs to be read and cached. (And possibly avoiding additional index writes, but hardly in your case.)
Use the function in all related queries. Include the same WHERE condition (even if logically redundant) to convince the query planner the index is applicable. For the simple query:
```
SELECT <some big columns>
FROM   my_table_ 
WHERE  when_ > f_when_cutoff()
ORDER  BY when_ DESC
LIMIT  1;
```

The size of the index grows with new (later) entries. Recreate the function with a later timestamp and REINDEX from time to time with no or little concurrent access. Only reindex after a relevant number of rows has been added. A couple of thousand entries won't matter much. We are doing this to cut off millions.
The beauty of it: queries don't change.

Implementation with function to update the partial index automatically:

Get latest child per parent from big table - query is too slow

More general advice:

Index optimization with dates

Related Solutions

Sql-server – Help with indexing narrow table with many millions of rows

Based on your question, I would index the Timestamp column with the clustered index. And to make the index unique, just make sure to include the identity column in the index definition:

... PRIMARY KEY CLUSTERED ([Timestamp], [Id])

If query performance for queries on Exchange_Id is still an issue after that, you can also add a non-clustered index that looks something like this:

UNIQUE INDEX ([Exchange_Id], [Timestamp], [Id])

.. but if you do, consider including any columns that the query may need, in order to create what's known as a covering index.

UNIQUE INDEX ([Exchange_Id], [Timestamp], [Id]) INCLUDE ([Type], [Amount], [Price])

Remember that there's a disk space issue involved as well, as you stated that your table contains a large number of rows. The clustered index will not change the amount of disk space your table consumes either way, but adding a non-clustered index will allocate extra space. If you INCLUDE all the columns from your table, like I did in the example, the non-clustered index will roughly take up as much space as the rest of the table does.

The best query to get the latest row version based on time

There is a 3rd option, using window functions, either max():

with latest as 
(select *,
        max(inserted_day) over (partition by hostname)
          as max_inserted_day 
 from age_data
)
select *              
from latest l
where inserted_day = max_inserted_day ;

or rank():

with latest as 
(select *,
        rank() over (partition by hostname order by inserted_day desc)
          as rnk
 from age_data
)
select *              
from latest l
where rnk = 1 ;

If there are ties (rows with same maximum inserted_day for a hostname), all the above queries (and yours) will return all the tied rows. If you want to resolve ties (and return only 1 or only 2 rows), you can replace rank() with row_number() and change the order by clause inside the over according to the requirements.

Regarding efficiency, an index on (hostname, inserted_day) will help, in all the queries.