PostgreSQL – UPDATE with Join Condition on Matching Words

full-text-searchpattern matchingpostgresqlpostgresql-9.4update

I have 2 tables that looks like this:

Table A:

CREATE TEMP TABLE table_a (
  Column_1 text,
  ID_number int
);
INSERT INTO table_a VALUES
  ('foo,bar,baz', 123),
  ('qux,quux,quuz',456),
  ('corge,grault,garply',789),
  ('qux,bar,grault', 101);

Table B:

CREATE TEMP TABLE table_b (
  Column_1 text,
  Column_2 text,
  ID_number int
);
INSERT INTO table_b VALUES
  ('foo','baz',null),
  ('qux','quzz',null),
  ('corge','garply',null);

I'm trying to copy across values from the ID_number column in Table A, where the values in Column 1 & 2 of table B can be found in the same row of Column 1 in Table A.

This is the kind of thing I was thinking of:

UPDATE table_b AS B 
SET id_number = A.id_number 
FROM table_a AS A 
WHERE A.column_1 LIKE B.column_1
  AND A.column_1 LIKE B.column_2

.. but obviously this doesn't work.

How can I translate this into a proper query?

Additional info

table_a.Column_1 contains UK addresses, for example:

'47 BOWERS PLACE, GREAT YARMOUTH, NORFOLK, NR20 4AN'

In table_b I have the first line of the address in Column_1 (so, '47 BOWERS PLACE') and the postcode ('NR20 4AN') in Column_2.

I thought it would be best to simplify things, but maybe the actual data has some relevance in this situation.

table_a has about 30 million addresses. table_b has around 60k rows.

Performance is relevant, the faster this runs the better, and it will likely be repeated in the future.

Best Answer

Assuming Postgres 9.6, performance is relevant, big tables, "words" composed of characters, no whitespace or punctuation, no stemming or stop words, no phrases, all columns NOT NULL.

Full Text search backed by an index should be among the fastest solutions:

UPDATE table_b b
SET    id_number = a.id_number 
FROM   table_a a
WHERE  to_tsvector('simple', a.column_1)
    @@ plainto_tsquery('simple', concat_ws(' ', b.column_1, b.column_2))
AND    b.id_number = a.id_number;  -- prevent empty UPDATEs

With a matching expression index on a.column_1:

CREATE INDEX table_a_column_1_idx ON table_a USING GIN (to_tsvector('simple', column_1));

`DELETE / INSERT`

Internally, due to the MVCC model of Postgres, every UPDATE effectively inserts a new row anyway and marks the old one as obsolete. So, behind the curtains there is not much difference between UPDATE and DELETE plus INSERT.
There are some details in favor of the UPDATE route:

HOT UPDATE.
TOAST tables: If you have large columns, the content may be stored."out-of-line" in TOAST tables and the new row version can link to the same row in the TOAST table if toasted columns remain unchanged.
Index maintenance may be cheaper for updates.

Otherwise, locking should be about the same. You need an exclusive lock on affected rows either way. Just make it quick.
If you are dealing with a huge number of rows and you don't need a consistent state (all rows or none), you can split the operation into multiple batches. (Separate transactions!) Increases the total cost, but keeps the lock time per row shorter.

Postgresql – Get TEXT value of a CLOB OID in Postgresql

Thanks goes to a_horse_with_no_name. Solution is:

update answers set answer_value = lo_get(cast(value as bigint))

Note - the lo_get function appears to be present in Postgres 9.4 or higher. For earlier versions, I don't see a way of doing this directly. I'm currently running 9.0, but this just accelerated my upgrade plans.

Best Answer

Related Solutions

PostgreSQL – Update All Columns from Another Table

DELETE / INSERT

Postgresql – Get TEXT value of a CLOB OID in Postgresql

Related Question

`DELETE / INSERT`