Postgresql – How to deal with seldom used big database and postgresql

azure-vmperformancepostgresqlpostgresql-performance

I loaded in PostgreSQL (9.3) OpenStreetMap data for whole europe (400gb). Then I installed a geocoding api (nominatim) that queries the database with gis queries.

My problem is :

This database is queried a few times a day, but because postgres loads
the data in its buffer on demand, my first query on a particular gps
point is always slow, and I do only one query per GPS point, so it's
always slow (like 60sec slow against 100ms when the data is buffered)

What could I do here ?

I'm using a A6 Azure instance (4 Cores, 28 GB memory, HDD).
Server is Ubuntu 14.04.4 LTS
Swapping is enabled
There is index on the geometry columns queried.
The shared buffer is 10GB
work mem is 256MB
maintenance work mem is 4GB

Best Answer

depending on your usage, you could try to partition the data to reduce the scanned data (at 400gb even your indexes are large and heavy to read...) I think it's your best option as working with 400gb tables is heavy even on a very optimized queries... You could also pre-warm the cache with the index data , by issuing some of your queries on frequent gps points or by using something like pg_prewarm

What you didn't ask

.. but, going out on a limb here, what are probably looking for.

The EXPLAIN output reveals that you filter the top 20 rows: (t.rank <= 20). Your presented query does not show that. The query actually matching your EXPLAIN output would be:

SELECT * FROM (
   SELECT player_id
        , rank() OVER (ORDER BY value DESC, player_id ASC) AS rank
   FROM   player_stat
   WHERE  stat_id = 1
   ) t
WHERE t.rank <= 20;

Which can be improved dramatically:

SELECT row_number() OVER (ORDER BY value DESC, player_id ASC) AS rank
     , player_id
FROM   player_stat
WHERE  stat_id = 1
ORDER  BY value DESC, player_id
LIMIT  20;

Explanation

The important part for performance is the LIMIT clause in combination with ORDER BY matching the index: now the query reads exactly 20 rows from the top to the index, where it had to read 10000000 in your original version. We only use player_id and value, so we can still have an index-only scan. The rest is peanuts.
That's all due to the sequence of events in a SELECT query: window functions are applied before LIMIT. Only if the sort order agrees, we don't have to consider the rest of the applicable 10000000 rows.
We can use LIMIT 20 because the top 20 ranks are guaranteed to span no more than 20 rows. The PK on (player_id, stat_id) guarantees unique player_id per stat_id and since that is included in the ORDER BY, each rank is only assigned once - which also means we can use the slightly cheaper row_number() instead.

PostgreSQL – Eliminating Duplicates and Optimizing Performance for Big Count and Small LIMIT

Query

Your question updates revealed that your query can be improved radically:

SELECT m.*
     , (SELECT keyfield  -- correlated subquery
        FROM   extractkeyfield 
        WHERE  messageid = m.messageid
        LIMIT  1)    AS keyfield
     , s.description AS sendername
     , r.description AS receivername
FROM  (
   SELECT messagetime    -- no DISTINCT necessary
        , messageid
        , clientid
        , message_type
        , contenturl
        , nativecontenturl
        , receiverid
        , status
        , nativeisa
        , nativegs
        , isa
        , gs
        , originalfilename
        , duplicatekeyfield
        , duplicatenativeid
        , count(*) OVER () AS full_count
   FROM   message
   WHERE  messagetime >= timestamp '2015-01-01 03:40:50'  -- incl. lower bound
   AND    messagetime <  timestamp '2016-04-01 03:35:10'  -- excl. upper bound
   ORDER  BY messagetime DESC NULLS LAST
   LIMIT  100            -- count and limit before joining more tables
   ) m
LEFT   JOIN clientprofile s ON s.clientid = m.clientid  -- join later
LEFT   JOIN clientprofile r ON r.clientid = m.receiverid;

Indexes

The perfect indexes for this query would be:

CREATE INDEX message_messagetime_idx ON message (messagetime DESC NULLS LAST);

Usefulness depends on the percentage of rows selected. (You still did not provide cardinalities.) For a small percentage, the index is more likely to be useful.

Your existing index message_messagetime_clientid_receiverid_status_messagetype works as well, less efficiently since it carries some more columns of "dead freight" for the use case.

Joins to clientprofile are covered by its PK. To optimize read performance you might create a multicolumn index on (clientid, description) to allow index-only scans. Since the rows are wide, this should pay.

Finally, extractkeyfield could profit from another multicolumn index on (messageid, keyfield). Again, only useful if you get index-only scans out of it. The rows are not as wide, the benefit is smaller than for clientprofile.

If you get all of this right, the query should be faster by orders of magnitude.

Major points

You still get an arbitrary pick from multiple related extractkeyfield.keyfield. That's according to your definition. Typically, a deterministic pick is more useful.

Since messageid is the PK of table message ~~DISTINCT ON (message.messagetime, message.messageid)~~ never makes sense. DISTINCT ON (message.messageid) would achieve the same, cheaper. Since you need the total count before LIMIT, a DISTINCT step makes the query much more expensive since all involved rows have to processed (not just counted). Luckily, DISTINCT can be avoided completely.
The correlated subquery with LIMIT 1 prevents duplicates a priori. So you don't need the expensive DISTINCT ON for the big table.
```
(SELECT keyfield
 FROM   extractkeyfield 
 WHERE  messageid = m.messageid
 LIMIT  1)
```
In Postgres 9.3+ you can also use a LATERAL join. Detailed explanation:
- Optimize GROUP BY query to retrieve latest record per user
Select, count and limit rows from your main table before you join to additional tables. That's what happens in my subquery m. Else you do a lot of additional work just to throw it away later.
You shouldn't need more work_mem with the fixed query. That was a side effect of the needless DISTINCT step for > 1M rows. Check for lines mentioning "Disk" in connection with "sort" in the EXPLAIN output of the new query.

I modified your predicate for the time range:

WHERE  messagetime >= timestamp '2015-01-01 03:40:50'  -- incl. lower bound
AND    messagetime <  timestamp '2016-04-01 03:35:10'  -- excl. upper bound

Consider the explanation:

PostgreSQL splitting time range into days

Your column messagetime can be NULL. Careful with descending sort order! I made that:
```
ORDER  BY messagetime DESC NULLS LAST
```
Details:
- Select first row in each GROUP BY group?

Aside: your table definitions can probably be improved. Among other things, varchar(255) is an unlikely candidate for a PK. And I suspect some of your many indexes might not be used. But that's beyond the scope of this question.