Postgres – caching of recently inserted data

performancequery-performance

Does Postgres cache recently INSERTed data? And if so, how can I increase the amount?

Here’s why I’m asking. I have a process that continuously inserts simple text messages into a table at an average rate of several million records a day (and contains several months of data). I have a second process that reads recent records from this table for processing. I’ve noticed that if I SELECT (for example) 100 records that have arrived in the last 6 hours it might take me 1 second. However if I select 200 record from the last 12 hours it might take (typically) 8 seconds and if I select 400 records from the last 24 hours it might take more than a minute.

Or in other words, there seems to be an almost exponential increase in the time to select data as it’s age increase. In my naiveite I had assumed there would be an approximately linear relationship between the period specified in the WHERE clause, and the time taken to retrieve the data.

For the record, here’s the table definition….

CREATE TABLE message
(
  messid SERIAL UNIQUE,
  rawmessage character(250), // raw, unprocessed message (ascii string)
  loadedtime timestamp without time zone, // Time this message was loaded 
  deviceNumber   INTEGER     // unique identifier of device that generated message.
)

And here’s the sort of query I’m running (deviceNumber and loadedTime are both indexed)

SELECT * FROM message
WHERE deviceNumber = 1234567
AND loadedtime >='10-May-2018 21:00:00’
AND loadedtime <='10-May-2018 22:00:00’

The problem is that the process that reads the table occasionally falls behind schedule and needs to catch up, but because of the exponential time increase in retrieving slightly older data it has difficulty getting the backlog down.

My question comes in two parts….

Why is there such a discrepancy in the time taken to select recent
data and older data. Is it because recent INSERTS are also held in
memory? 2)
Is there anyway to improve the time to select data added
in the last 12 hours to be close to that achieved with data only one
hour old?

Thanks

Update – I'm will attempt to add a composite index on both loadedTime and deviceNumber as suggested (both already have individual indexes). This may take some time.
Once I've added a composite index, do I still need the individual indexes on those fields?

Best Answer

Caching by Postgres or the operating system isn't going to help much when Postgres doesn't know which of the millions of records in your table fulfill your query's condition(s). The only way to make sure this is going to be consistently performing as you expect is to use indexes.

An index on the combination of deviceNumber and loadedtime is what you need in your case.

In your question you note that a query can be relatively fast (like a second for recently inserted data), so I'm strongly suspecting you already have an index on at least one of these columns.

Related Solutions

Postgresql – How to unite 2 subqueries to reduce retrieving data from cumbersome tables

One idea is to count using an expression:

SELECT u.id, u.login, u.last_login_time
     , COUNT(CASE WHEN COALESCE(c.message_last_read_id_user1, 0) 
                     < COALESCE(message_last_write_id_user2, 0) 
                   AND m.id > COALESCE(message_last_read_id_user1, 0) 
                   AND m.id <= COALESCE(message_last_write_id_user2, 0) 
                   AND m.poster_id = c.user2_id
                  THEN 1
             END) as unread_user1
     , COUNT(CASE WHEN COALESCE(message_last_read_id_user2, 0) 
                     < COALESCE(message_last_write_id_user1, 0) 
                   AND c.user1_id > 0 
                   AND m.id > COALESCE(message_last_read_id_user2, 0) 
                   AND m.id <= COALESCE(message_last_write_id_user1, 0) 
                   AND m.poster_id <> c.user2_id
                  THEN 1
             END) as unread_user2   
FROM user u 
JOIN conversation c
    ON c.user1_id = u.id
JOIN message m
    ON m.conversation_id = c.id
WHERE u.last_login_time >= 1452504418
GROUP BY u.id, u.login, u.last_login_time;

I might have missed some parts when rewriting, but hopefully you will get some ideas.

PostgreSQL planner choosing btree or gist index for few result rows

I used this simpler test setup instead:

CREATE TABLE sampledata AS
SELECT row_number() OVER ()::int
     , extract(hour FROM ts)::int AS hour
     , ts
FROM   generate_series (timestamptz '2004-03-07'
                      , timestamptz '2004-03-11'
                      , interval '1 second') ts;  -- instead of millisecond
CREATE INDEX idx_hourts_btree ON sampledata USING btree (hour, ts);
CREATE INDEX idx_idts_gist    ON sampledata USING gist  (row_number, ts);
ANALYZE sampledata;

345601 rows instead of 345M are enough to test the effect in question.

Your primary question:

Why the different planning?

The answer: because of the faster estimate for the GiST index in the 2nd case. In my setup:

EXPLAIN ANALYZE
SELECT * FROM sampledata
WHERE hour = 18 AND ts between '2004-03-08 18:30:00' AND '2004-03-08 18:30:00';

Index Scan using idx_idts_gist    on sampledata  (cost=0.28..5.30 rows=1 width=16) ...

The plan after deleting the GiST index with DROP INDEX idx_idts_gist;:

Index Scan using idx_hourts_btree on sampledata  (cost=0.42..5.45 rows=1 width=16) ...

Note the bold estimate. So the GiST index wins. Postgres has no other option than to rely on its estimates to decide the "best" query plan. Even though the btree index turns out to be faster for the particular case, Postgres can't know that before executing the query.

Estimates are based on table statistics, cost settings and intricate details of the query planner that calculates estimates based on these numbers. If table statistics, cost settings or characteristics of certain operations (index scan on different index types in particular for this case) are inaccurate, the estimate is off more or less.

Keep PostgreSQL from sometimes choosing a bad query plan

Query & indexes

All that aside, your index and query has several flaws.

Don't use a timestamp literal for a timestamptz value. It makes you depend on the current time zone setting, which can have unexpected effects like different results in different sessions. Provide timestamptz literals!

SELECT * FROM sampledata
WHERE  ts BETWEEN '2004-03-08 18:30:00+01'  -- with your intended time zone offset
              AND '2004-03-08 18:30:00+01';

Or, more explicit:

SELECT * FROM sampledata
WHERE  ts BETWEEN timestamptz '2004-03-08 18:30:00+01'
              AND timestamptz '2004-03-08 18:30:00+01';

Also note hour = 18 is gone. See below.
The explicit data type cast is redundant while we know that ts is type timestamptz.

Details:

Ignoring timezones altogether in Rails and PostgreSQL

Also, query and indexes are inefficient in several aspects:

The completely redundant predicate hour = 18 in the WHERE clause only seems to make sense while working with the sub-optimal index on (hour, ts). Remove it.
The column hour is not helping your queries in the index idx_HourTs_btree. It's completely redundant, just bloat. Remove it.
The column row_number in idx_idts_gist is of no use for the given queries, either. Remove it.

These indexes would be cheaper and faster:

CREATE INDEX idx_ts_btree ON sampledata USING btree (ts);
CREATE INDEX idx_ts_gist  ON sampledata USING gist  (ts);

Combined with the query above.

The completely redundant column hour has no conceivable use at all. Just drop it, thereby making the table smaller and everything a bit faster.

Aside, since you seem to be operating on large cardinalities, a BRIN index might be an interesting alternative if your data is physically sorted, mostly:

CREATE INDEX idx_ts_brin ON sampledata USING brin (ts);

Or maybe with fewer pages per range:

CREATE INDEX idx_ts_brin ON sampledata USING brin (ts) WITH (pages_per_range = 8);

Read performance will be faster with a btree index if you have sufficient memory and your index is cached most of the time. The major benefit of the BRIN index is the extremely small index size. Test and consider the complete situation.

Addressing update 2018-03-14

Increasing the statistics target like you demonstrate does have an effect on cost estimates in my tests. Just not on the special case of BETWEEN with a single value as lower and upper bound (y BETWEEN x AND x). The Postgres query planner is no AI, nor does it try to be: estimations have to be simple and fast.

(Aside, only statistics for id and ts matter, not on rand.)

The query planner obviously underestimates the cost for the GiST for this special case, which has a different characteristic than an actual range. I can't say that I blame the Postgres developers that nobody payed attention to this special case, since y BETWEEN x AND x does not make a whole lot of sense to begin with. Normally, one would simplify to y = x, for which Postgres switches to using the btree index in my tests.

You can run a different query with y = x for the special case. You have to check user input anyway, since the upper bound cannot be below the lower bound for BETWEEN.

You could also create the much more efficient GIST index on just (ts) I mentioned above, which delivers performance close to the btree index in my tests.

You might report a bug and see if any developer bites. But first update your Postgres version to 9.5.12 at least to avoid noise.

And you did.

Turns out this has been improved in /devel to be released with Postgres 11. But the change won't be back-patched to older versions.