PostgreSQL – Efficient SELECT for Array of Tuples

performancepostgresqlpostgresql-performance

Assume I have a table records with the following structure

id (unique int)
updated (timestamp)

I have an input_array with values [[id_1, timestamp_1], [id_4, timestamp_4], ...]. I'll refer to each element as tuple_1, tuple_4, etc.

I'm looking for the most efficient query (in PostgreSQL v11.2+) to select [id_1, id_4, ...] from records, but only where tuple_{n}.updated > row{n}.updated. Assume input_array may contains thousands of tuples, and records upwards of a million rows.

I don't even know where to begin with this. Lateral join comes to mind, as does unnest, and where in, but everything I've tried so far fails miserably

Update I'm open to input_array being in any format (tuples, two separate arrays, whatever), and updated being an int

Best Answer

If you aren't fixed on the array input, you can use a tuple comparison.

select *
from records
where (id, updated) in ( (1, timestamp '2019-01-01 00:00:00'),
                         (2, timestamp '2019-01-02 00:00:00') )

That can make use of a regular btree index on (id, updated)

Note that this uses = for both values and is equivalent to

where (id = 1 and updated = timestamp '2019-01-01 00:00:00')
   or (id = 2 and updated = timestamp '2019-01-02 00:00:00')

But you want to compare the timestamps using >. You can do that if you join against a values clause:

select r.*
from records r
  join (
    values 
       (1, timestamp '2019-01-01 00:00:00'),
       (2, timestamp '2019-01-02 00:00:00') 
  ) as t(id,upd) on  r.id = t.id
where r.updated > t.upd;

Online example: https://rextester.com/GGC83046

Related Solutions

Postgresql – Storing logs on per day basis in PostgreSQL

The best option IMO is to use a cron job to generate the logs for 'yesterday'::date. You could also use triggers before insert/update/delete to update the other table but this adds complexity and overhead, and for the current day, but this gets pretty complicated. Generate your historical logs once the data won't change anymore.

In this case you write a sql query and run it via psql and cron.

I would also add a trigger denying update or delete to records covered in your historical data if you can.

This gives you a few benefits:

It is more obvious when it breaks
It is simpler, with simpler failure cases

Now, as per your concerns:

You say you need rows for every day. This can be handled a number of relatively easy ways in PostgreSQL (remember that dates support integer math so you can take a base date and add a series to it, to generate a date series). This is a pretty easy way to get around if you are generating rows per day of week, etc.
You say you cant guarantee things won't change. The key question here is what your change window is and to do your historical reports after this window has closed. For example if it is after a month, you can generate reports for all dates in a month a month prior (i.e. generate all dates in January during early March). You can then rely on a view to handle newer rows vs older rows in a live basis. You can then have a trigger which ensures that the date of an inserted row in the orders table is newer than the newest date in the other table.

In my experience worrying about keeping this as a live summary usually isn't necessary. Small organizations (with small data sets) tend to close out books at least once a year, and live reporting is an option there. Larger organizations with larger data sets tend to close out receivables and payables (i.e. invoices) once a month or so, and so the only areas that have to be reported live (because they are subject to adjustment or revision) are open orders (which can be revised) and invoices which may need to be reviewed occasionally (and should never be revised but may have adjustments issued against them which might or might not need to be tracked in such a system).

PostgreSQL – ON DELETE Rule Not Working with WHERE Clause

The manual in Rules on INSERT, UPDATE, and DELETE) describes the INSTEAD mechanism for the context of your rule as:

Qualification given and INSTEAD

the query tree from the rule action with the rule qualification an the original query tree's qualification; and the original query tree with the negated rule qualification added

Overlooking the part emphasized above would be the cause of the unexpected result. I believe that in your test case, the mentioned DELETE will be transformed by the rule into commands to the same effect as:

UPDATE categories SET deleted_at = NOW()
    WHERE categories.id = categories.id AND deleted_at IS NULL;

DELETE from categories WHERE NOT old.deleted_at IS NULL;

The DELETE removes the row because the UPDATE before it has just set its deleted_at field to non-null.

Note also the categories.id = categories.id condition which is obviously not needed and comes from the misunderstanding that OLD does not reference a specific row like in a trigger, it's meant to be replaced by the table subject to the rule.

The main point to understand is that rules don't look themselves at the rows and so they don't generate a kind of IF-THEN-ELSE construct depending on the row contents. The WHERE clauses of the RULES (the qualifications) are not evaluated by the rule system, they are injected into the commands produced by the rule system and will be evaluated by the commands themselves.

See also What are PostgreSQL RULEs good for? and links within for some interesting insights on rules and why they're hard to use.

Best Answer

Related Solutions

Postgresql – Storing logs on per day basis in PostgreSQL

PostgreSQL – ON DELETE Rule Not Working with WHERE Clause

Related Question