If you have the space, you can CTAS using minimal undo/redo. If you have any indexes at all, doing it any other way will be very slow and generate logging like crazy.
In the case where you have a single IOT without any secondary indexes, or a single table cluster, you could step through the primary/cluster key updating in chunks without having to rescan the whole table to find the fields that have not yet been updated.
--edit
I am not able to create a secondary table ... There are a couple of
indexes, but none of them involve the column I am updating.
Then I suggest break the table up into chunks for processing using something you are indexing on (even if it is a single column, you can split it into ranges of values) This will do a FTS once instead of once for each chunk as in your code. You will have to live with an awful lot of redo and will wipe out your undo space too (so no flashback subsequently)
--edit2
if you can add/rename/drop columns, you can do this very efficiently, but only on 11g
Query
Your query is forced to scan the whole table (or the whole index). Every row could be another distinct unit. The only way to substantially shorten the process would be a separate table with all available units - which would help as long as there are substantially fewer units than entries in all_units
.
Since you have ~ 11k units (added in comment) for 25M entries, this should definitely help.
Depending on frequencies of values, there are a couple of query techniques to get your result considerably faster:
- recursive CTE
JOIN LATERAL
- correlated subquery
Details in this related answer on SO:
Only needing the implicit index of the primary key on (unit_id, unit_timestamp)
, this query should do the trick, using an implicit JOIN LATERAL
:
SELECT u.unit_id, a.max_ts
FROM unit u
, (SELECT unit_timestamp AS max_ts
FROM all_units
WHERE unit_id = u.unit_id
ORDER BY unit_timestamp DESC
LIMIT 1
) a;
Excludes units without entry in all_units
, like your original query.
Or a lowly correlated subquery (probably even faster):
SELECT u.unit_id
, (SELECT unit_timestamp
FROM all_units
WHERE unit_id = u.unit_id
ORDER BY unit_timestamp DESC
LIMIT 1) AS max_ts
FROM unit u;
Includes units without entry in all_units
.
Efficiency depends on the number of entries per unit. The more entries, the more potential for one of these queries.
In a quick local test with similar tables (500 "units", 1M rows in big table), the query with correlated subqueries was ~ 500x faster than your original. Index-only scans on the PK index of the big table vs. sequential scan in your original query.
Since your table tends to get even larger rapidly
, a materialized view is probably not an option.
There is also DISTINCT ON
as another possible query technique, but it's hardly going to be faster than your original query, so not the answer you are looking for. Details here:
Index
Your partial_idx
:
CREATE INDEX partial_idx ON all_units (unit_id, unit_timestamp DESC);
is not in fact a partial index and also redundant. Postgres can scan indexes backwards at practically the same speed, the PK serves well. Drop this additional index.
Table layout
A couple of points for your table definition.
CREATE TABLE all_units (
unit_timestamp timestamp,
unit_id int4,
lon float4,
lat float4,
speed float4,
status varchar(255), -- might be improved.
PRIMARY KEY (unit_id, unit_timestamp)
);
timestamp(6)
doesn't make much sense, it's effectively the same as just timestamp
, which already saves a maximum of 6 fractional digits.
I switched positions of the first two columns to save 4 bytes of padding, which amounts to ~ 100 MB for 25M rows (exact result depends on status
). Smaller tables are typically faster for everything.
If status
isn't free text, but some kind of standardized note, you could replace it with something a lot cheaper. More about varchar(255)
in Postgres.
Server configuration
You need to configure your server. Most of your settings seem to be conservative defaults. 1 MB on shared_buffers
or work_mem
seems way to low for an installation with millions of rows. And random_pare_cost = 4
is to high for any modern system with plenty of RAM. Start with the manual and the Postgres Wiki:
Best Answer
Let's say your statement is:
1) This does extra work, because it updates column
z
toHELLO WORLD
, even if it wasHELLO WORLD
already. Rewrite it as:2) If you have 2 columns indexed seperately, it is possible, but quite rare, that the database uses both indexes for the lookup. You can try creating 1 index having both columns, but I highly doubt this would be of any help, you would need a big, wide table with exceptionally good clustering for
x
andy
columns to benefit from this for 3/100 million rows.3) DML does not run in parallel by default. You can enable parallel execution as:
Note that this will not force parallel execution even if the syntax suggests that. It's like adding parallel 8 hint to your statements in your session.
4) If column
z
is indexed, than theUPDATE
statement need to maintain the index(es) as well. Sometimes it is better to make the index(es) unusable, run the DML, then rebuild the index(es).5) If the content of column
z
does not fit its original place (HI
becomesHELLO WORLD
), and your table blocks are full, you may experience row migration, that makes performance worse not only during theUPDATE
, but for later queries as well.You would be surprised how much better CTAS can be.
6) If column
z
is part of an enabled foreign key constraint, you can not simply update it to any value. The database needs to check the parent values to make sure you are updating to a valid value. Again, disable, run the DML, enable validate, may improve performance.But this is just guesswork without the facts. An SQL Monitoring output, or execution plan with runtime statistics could help. If that is not enough, monitoring the wait events + parameters in v$session, or a raw SQL trace could help even more.