PostgreSQL – Sort Table Rows and Save Row Numbers Using UPDATE

performancepostgresqlquery-performanceupdatewindow functions

I have Postgres 9.5 with a table movimientos that has the following data:

| id | concepto | movimiento | numero | orden |
|  1 | AJUSTE 1 |       2542 |      0 |     2 |
|  2 | APERTURA |      12541 |      0 |     1 |
|  3 | AJUSTE 2 |       2642 |      0 |     2 |
|  4 | CIERRE   |      22642 |      0 |     3 |

And I need to number the records based on the orden field and keep these numbers in the numero field, because I need this data to sort and search by numero in reports. Example:

| id | concepto | movimiento | numero | orden |
|  2 | APERTURA |      12541 |      1 |     1 |
|  1 | AJUSTE 1 |       2542 |      2 |     2 |
|  3 | AJUSTE 2 |       2642 |      3 |     2 |
|  4 | CIERRE   |      22642 |      4 |     3 |

I tried to do it using a function with a FOR but is very slow with a million rows.

How to do this using a simple UPDATE?

Best Answer

Join to a subquery that computes numero with the window function row_number():

UPDATE movimientos m
SET    numero = sub.rn
FROM  (SELECT id, row_number() OVER (ORDER BY orden, id) AS rn FROM movimientos) sub
WHERE  m.id = sub.id;

Details for UPDATE syntax in the manual.

If you have concurrent write access you need to lock the table to avoid race conditions.

Note that updating every row in a table is expensive either way. The table typically grows to twice its size and VACUUM or VACUUM FULL may be in order.

Depending on your complete situation it may be more efficient to write a new table to begin with. Related answers with instructions:

I am not convinced, though, that you need the column numero in your table at all. Maybe you are looking for a MATERIALIZED VIEW. Recent related answer on SO:

Global row numbers in chunked query

Postgres 9.2 or later

You can make the index covering by appending fcv_id:

CREATE INDEX factura_venta_orden
ON factura_venta (fcv_fecha_comprobante, fcv_numero_comprobante, fcv_id);

This way, provided the table isn't updated too much, Postgres can retrieve results with an index-only scan.

The additional column comes last since it does not contribute to the sort order. Explanation:

Is a composite index also good for queries on the first field?

In Postgres 11 or later you could make that:

CREATE INDEX factura_venta_orden
ON factura_venta (fcv_fecha_comprobante, fcv_numero_comprobante) INCLUDE (fcv_id);

`CLUSTER` / `pg_repack`

I see you already found CLUSTER. You are aware that this is a one-time operation, that should help your cause, but needs to be re-run after enough updates?

There is also the community tool pg_repack as replacement for VACUUM FULL / CLUSTER.

`work_mem`

This line in your EXPLAIN output:

Sort Method:  external merge  Disk: 2928kB

tells us, that sorting is not done in RAM, which is expensive. You could probably improve performance by tuning the according setting for work_mem

work_mem (integer)

Specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. ...

Setting this too high may have adverse effects. Read the manual carefully. Consider increasing the setting only for the transaction with the big query:

BEGIN;
SET LOCAL work_mem  = '50MB';
SELECT ...;
COMMIT;

50 MB are an estimate based on your EXPLAIN ANALYZE output for 73k rows. Test with 1M rows to get the actual amount you need.

Postgresql – Efficient query to get greatest value per group from big table

Index

A plain multicolumn B-tree index should work after all:

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);

Why DESC NULLS LAST?

Unused index in range of dates query

It's safe to assume you have an equipment table? Then performance won't be a problem:

Correlated subquery

Based on this equipment table, run a lowly correlated subquery to great effect:

SELECT equipment_id
     ,(SELECT created_at
       FROM   geoposition_records
       WHERE  equipment_id = eq.equipment_id
       ORDER  BY created_at DESC NULLS LAST
       LIMIT  1) AS latest
FROM   equipment eq;

For a small number of rows in the equipment table (57 judging from your EXPLAIN ANALYZE output), that's very fast.

`LATERAL` join in Postgres 9.3+

SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
   SELECT created_at
   FROM   geoposition_records
   WHERE  equipment_id = eq.equipment_id
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) r(latest) ON true;

Detailed explanation:

Optimize GROUP BY query to retrieve latest record per user

Performance similar to the correlated subquery.

Function

If you can't talk sense into the query planner (which shouldn't occur), a function looping through the equipment table is certain to do the trick. Looking up one equipment_id at a time uses the index.

CREATE OR REPLACE FUNCTION f_latest_equip()
  RETURNS TABLE (equipment_id int, latest timestamp)
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   FOR equipment_id IN
      SELECT e.equipment_id FROM equipment e ORDER BY 1
   LOOP
      SELECT g.created_at
      FROM   geoposition_records g
      WHERE  g.equipment_id = f_latest_equip.equipment_id
                           -- prepend function name to disambiguate
      ORDER  BY g.created_at DESC NULLS LAST
      LIMIT  1
      INTO   latest;

      RETURN NEXT;
   END LOOP;
END  
$func$;

Makes for a nice call, too:

SELECT * FROM f_latest_equip();

Performance comparison:

db<>fiddle here
_{OLD sqlfiddle}

Best Answer

Related Solutions

Optimize PostgreSQL Query – Using ORDER BY Date and Text

Postgres 9.2 or later

CLUSTER / pg_repack

work_mem