Postgresql – How to speed up a string manipulation query where I want to replace characters, extract certain values and update a table with the results

arrayoptimizationpostgresqlstring manipulationupdate

I am trying to extract information from strings that are presented in a key-value format, with the keys and values being separated by commas. I want to extract the values associated with certain keys and add them to dedicated columns in my table.

Some notes on the data:

The keys I am interested in are connected, as in keyB relates to keyA;
in some cases keyB or keyA may not exist
If keyB doesn't exist, but keyA is something specific, then I can set the value for keyB anyway.

I have a solution that does what I want ([db-fiddle]), but it is painfully slow (9.6 hours) and I can't help think that there must be a better way as I've not been in this DB game long.

For info,
The table has ~8.2M rows and is hosted on AWS RDS on a t3.large DB (virtual CPUs = 2, Memory = 8.0GB).

Some pointers on where I can improve this are much appreciated.

Best Answer

You only need a single INSERT and a single UPDATE for the main table. The whole default detection and populating of the "variables" (columns) can be done while inserting the data:

INSERT INTO main_tags (id, tags, variablea, variableb)
select id, tags, var_a, 
       case 
          when var_b is null and 'valA' = any(tags)  then 'valB_default'
          when var_b is null and 'valA_xx' = any(tags) then 'valB_default_two'
          else var_b
       end as var_b
from (
  SELECT id, tags, 
         tags[array_position(tags, 'keyA') + 1] as var_a, 
         tags[array_position(tags, 'keyB') + 1] as var_b
  FROM (
    SELECT id, string_to_array(regexp_replace(tags,'[{}"]','','g'), ',') as tags
    FROM main
  ) as b 
) x  
WHERE tags && array['keyA','keyB'];

The final update can also handle the "unknown" value directly, no need to run three updates:

UPDATE main e
  SET variableA = coalesce(et.variableA, 'unknown'),
      variableB = coalesce(et.variableB, 'unknown')
FROM main_tags et
WHERE e.id = et.id;

One way to speed that up, is to create an index on main_tags (id)

You can actually get rid of the temp table completely and do everything in a single statement. This might or might not be faster, but I think it's worth trying:

UPDATE main e
  SET variableA = coalesce(et.var_a, 'unknown'),
      variableB = coalesce(et.var_b, 'unknown')
FROM (
  select id, tags, 
         var_a, 
         case 
            when var_b is null and 'valA' = any(tags)  then 'valB_default'
            when var_b is null and 'valA_xx' = any(tags) then 'valB_default_two'
            else var_b
         end as var_b
  from (
    SELECT id, 
           tags, 
           tags[array_position(tags, 'keyA') + 1] as var_a, 
           tags[array_position(tags, 'keyB') + 1] as var_b
    FROM (
      SELECT id, string_to_array(regexp_replace(tags,'[{}"]','','g'), ',') as tags
      FROM main
    ) as b 
  ) x  
  WHERE tags && array['keyA','keyB']
) as et
WHERE e.id = et.id;

Related Solutions

MySQL – SELECT and UPDATE in a Single Query for Duplicate Records

Since there is a unique constraint/index on number, you can use ON DUPLICATE KEY UPDATE syntax - which can be combined with INSERT ... VALUES so all the rows are inserted/updated in one statement. The additional requirement that new null values do not update existing ones can be satisfied using COALESCE() function. Example:

INSERT INTO Contacts
    (number, name, city)
VALUES
    (1234567892, 'Earl Grey', 'Salem'),
    (3333333333, 'Rich Malfoy', 'Sacramento'),
    (7656453248, 'Bill Gates', 'New York'),
    (5555555555, 'James McGill', 'Boston'),
    (6666666666, NULL, 'Baton Rouge')
ON DUPLICATE KEY UPDATE
    name = COALESCE(VALUES(name), name),
    city = COALESCE(VALUES(city), city) ;

Test at SQLfiddle

Postgresql – Query works locally but not on AWS RDS

Our final solution was to split the query so we did pivoting and denormalization in separate steps. This is probably a cleaner solution anyway since it follows the "just do one thing" principle so will probably be easier to maintain.

Thanks for all of the help!

Best Answer

Related Solutions

MySQL – SELECT and UPDATE in a Single Query for Duplicate Records

Postgresql – Query works locally but not on AWS RDS

Related Question