PostgreSQL – Optimizing Views Dependent on Aggregate Functions

indexpostgresqlview

I have a table of about 15 million records. Whenever information about a specific charge_id is changed, a new row is added with the current timestamp and the changes. This results in multiple rows with the same charge_id, and related hierarchical columns. This isn't controlled by me and can't be changed, plus we like having the history available for querying.

The view below is intended to identify the most recent entry for each charge_id, and create a simple key value pair table for joining off of. The view works fine, but execution time is horrendous. I've tried a couple of stabs at an index to speed things up, but each time it appears that postgres is skipping the index and scanning everything anyways. I should also note that most queries where we're joining to this view are going to be full table aggregates, grouped by 2-5 different dimensions in the charges table.

My question specifically is what can I do to speed up execution time on this particular view?

CREATE VIEW current_charge_ids AS 
(
SELECT
    c2.id,
    c2.charge_id,
    t1.last_post_date
FROM
    charges c2
LEFT JOIN
    (
        SELECT
            c1.client,
            c1.practice,
            c1.account_id,
            c1.encounter_id,
            c1.charge_id,
            max(c1.post_date) AS last_post_date
        FROM
            charges c1
        GROUP BY
            c1.client,
            c1.practice,
            c1.account_id,
            c1.encounter_id,
            c1.charge_id
    ) t1
ON
    c2.client = t1.client AND
    c2.practice = t1.practice AND
    c2.account_id = t1.account_id AND
    c2.encounter_id = t1.encounter_id AND
    c2.charge_id = t1.charge_id AND
    c2.post_date = t1.last_post_date
);

Best Answer

You can hugely simplify the query using DISTINCT ON:

SELECT DISTINCT ON (client, practice, account_id, encounter_id, charge_id)
       id, charge_id, post_date
FROM   charges
ORDER  BY client, practice, account_id, encounter_id, charge_id, post_date DESC, id DESC;

Will be considerably faster in any case. Detailed explanation in this related answer on SO:
Select first row in each GROUP BY group?

The last ORDER BY expression id DESC is optional to break ties if rest should not be unambiguous, yet. May not be needed.

Support this with a matching multicolumn index:

CREATE INDEX charges_latest_idx ON charges
(client, practice, account_id, encounter_id, charge_id, post_date DESC, id DESC);

Whether such an index will be useful depends on undisclosed details.

Note in particular, that sort order has to match the query. In Postgres 9.2 or later this may even function as covering index, depending on undisclosed details.

Depending on undisclosed details, a materialized view might be a candidate, too. The more write operations the smaller the likelihood this would help. The same goes for the covering index.