PostgreSQL – UPDATE with Join Condition on Matching Words

full-text-searchpattern matchingpostgresqlpostgresql-9.4update

I have 2 tables that looks like this:

Table A:

CREATE TEMP TABLE table_a (
  Column_1 text,
  ID_number int
);
INSERT INTO table_a VALUES
  ('foo,bar,baz', 123),
  ('qux,quux,quuz',456),
  ('corge,grault,garply',789),
  ('qux,bar,grault', 101);

Table B:

CREATE TEMP TABLE table_b (
  Column_1 text,
  Column_2 text,
  ID_number int
);
INSERT INTO table_b VALUES
  ('foo','baz',null),
  ('qux','quzz',null),
  ('corge','garply',null);

I'm trying to copy across values from the ID_number column in Table A, where the values in Column 1 & 2 of table B can be found in the same row of Column 1 in Table A.

This is the kind of thing I was thinking of:

UPDATE table_b AS B 
SET id_number = A.id_number 
FROM table_a AS A 
WHERE A.column_1 LIKE B.column_1
  AND A.column_1 LIKE B.column_2

.. but obviously this doesn't work.

How can I translate this into a proper query?

Additional info

table_a.Column_1 contains UK addresses, for example:

'47 BOWERS PLACE, GREAT YARMOUTH, NORFOLK, NR20 4AN'

In table_b I have the first line of the address in Column_1 (so, '47 BOWERS PLACE') and the postcode ('NR20 4AN') in Column_2.

I thought it would be best to simplify things, but maybe the actual data has some relevance in this situation.

table_a has about 30 million addresses. table_b has around 60k rows.

Performance is relevant, the faster this runs the better, and it will likely be repeated in the future.

Best Answer

Assuming Postgres 9.6, performance is relevant, big tables, "words" composed of characters, no whitespace or punctuation, no stemming or stop words, no phrases, all columns NOT NULL.

Full Text search backed by an index should be among the fastest solutions:

UPDATE table_b b
SET    id_number = a.id_number 
FROM   table_a a
WHERE  to_tsvector('simple', a.column_1)
    @@ plainto_tsquery('simple', concat_ws(' ', b.column_1, b.column_2))
AND    b.id_number = a.id_number;  -- prevent empty UPDATEs

With a matching expression index on a.column_1:

CREATE INDEX table_a_column_1_idx ON table_a USING GIN (to_tsvector('simple', column_1));