PostgreSQL – Efficiently Retrieve Users by Age

datepostgresql

I have a PostgreSQL database of users and their birthdays and want to get all users in a specific age range:

SELECT * FROM users
WHERE age(birthday) >= 21 AND age(birthday) <= 30;

How can I achieve that this query stays efficient, even when there are millions of users in the database?
Of course there are more filters, I just don't want to calculate the age for every row.

I do not want to use a materialized view.

Best Answer

You want to move the computation from the column, to the other side of the inequality operators where you have literals rather than a column:

WHERE birthday >= current_date - interval '30 years' AND birthday <= current_date - interval '21 years';

Then a simple index on "birthday" will make it efficient. It might need some fiddling depending on what datatype "birthday" is. I tested it as a timestamp without time zone.

Related Solutions

Postgresql – Efficiently comparing prices in different currencies

Here's a different approach for which recomputing the calculated_price is just an optimization, as opposed to being strictly necessary.

Suppose that in the currencies tables, you add another column, last_rate, which contains the exchange rate at the time the calculated_price was last updated, no matter when this happened.

To quickly retrieve a set of products with a price point between, say, 50 USD and 100 USD that include the desired results, you could do something like that:

  SELECT * FROM products
   WHERE calculated_price > 50.0/(:last_rate*
    (SELECT coalesce(max(value/last_rate),1) FROM currencies
      WHERE value>last_rate))
   AND calculated_price < 100.0/ (:last_rate*
    (SELECT coalesce(min(value/last_rate),1) FROM currencies
      WHERE value<last_rate))

where :last_rate contains the EUR/USD exchange rate at the time of the last update. The idea is to increase the interval to take into account the maximum variation of every currency. The increase factors for both ends of the interval are constant between rates updates, so they could be pre-computed.

Since the rates change only slightly over short periods of time, the above query is likely to give a close approximation of the final result. To get the final result, let's filter out the products for which the prices have slipped out of the bounds due to the changes in rates since the last update of calculated_price:

  WITH p AS (
   SELECT * FROM products
   WHERE calculated_price > 50.0/(:last_rate*
    (SELECT coalesce(max(value/last_rate),1) FROM currencies
      WHERE value>last_rate))
   AND calculated_price < 100.0/ (:last_rate*
    (SELECT coalesce(min(value/last_rate),1) FROM currencies
      WHERE value<last_rate))
  )
  SELECT price,c.value FROM p join currencies c on (p.currency=c.id)
     WHERE price/c.value>50/:current_rate
       AND price/c.value<100/:current_rate;

where :current_rate is the more up-to-date rate with EUR for the money choosen by the user.

The efficiency comes from the fact that the range of rates is supposed to be small, the values being close together.

PostgreSQL – Refresh Materialized View Incrementally

You can always implement your own table serving as "materialized view". That's how we did it before MATERIALIZED VIEW was implemented in Postgres 9.3.

You can create a plain VIEW:

CREATE VIEW graph_avg_view AS 
SELECT xaxis, AVG(value) AS avg_val
FROM   graph
GROUP  BY xaxis;

And materialize the result once or whenever you need to start over:

CREATE TABLE graph_avg AS
SELECT * FROM graph_avg_view;

(Or use the SELECT statement directly, without creating a VIEW.)
Then, depending on undisclosed details of your use case, you can DELETE / UPDATE / INSERT changes manually.

A basic DML statement with data-modifying CTEs for your table as is:

Assuming nobody else tries to write to graph_avg concurrently (reading is no problem):

WITH del AS (
   DELETE FROM graph_avg t
   WHERE  NOT EXISTS (SELECT FROM graph_avg_view WHERE xaxis = t.xaxis)
   )
, upd AS (
   UPDATE graph_avg t
   SET    avg_val = v.avg_val
   FROM   graph_avg_view v
   WHERE  t.xaxis = v.xaxis
   AND    t.avg_val <> v.avg_val
-- AND    t.avg_val IS DISTINCT FROM v.avg_val  -- alt if avg_val can be NULL
   )
INSERT INTO graph_avg t  -- no target list, whole row
SELECT v.*
FROM   graph_avg_view v
WHERE  NOT EXISTS (SELECT FROM graph_avg WHERE xaxis = v.xaxis);

Basic recipe

Add a timestamp column with default now() to your base table. Let's call it ts.
- If you have updates, add a trigger to set the current timestamp with every update that changes either xaxis or value.

Create a tiny table to remember the timestamp of your latest snapshot. Let's call it mv:

CREATE TABLE mv (
   tbl text PRIMARY KEY
 , ts timestamp NOT NULL DEFAULT '-infinity'
); -- possibly more details

Create this partial, multicolumn index:

CREATE INDEX graph_mv_latest ON graph (xaxis, value)
WHERE  ts >= '-infinity';

Use the timestamp of the last snapshot as predicate in your queries to refresh the snapshot with perfect index usage.
At the end of the transaction, drop the index and recreate it with the transaction timestamp replacing the timestamp in the index predicate (initially '-infinity'), which you also save to your table. Everything in one transaction.
Note that the partial index is great to cover INSERT and UPDATE operations, but not DELETE. To cover that, you need to consider the entire table. It all depends on exact requirements.

Best Answer

Related Solutions

Postgresql – Efficiently comparing prices in different currencies

PostgreSQL – Refresh Materialized View Incrementally

Basic recipe

Related Question