Postgresql – How to speed up a string manipulation query where I want to replace characters, extract certain values and update a table with the results

arrayoptimizationpostgresqlstring manipulationupdate

I am trying to extract information from strings that are presented in a key-value format, with the keys and values being separated by commas. I want to extract the values associated with certain keys and add them to dedicated columns in my table.

Some notes on the data:

  1. The keys I am interested in are connected, as in keyB relates to keyA;
  2. in some cases keyB or keyA may not exist
  3. If keyB doesn't exist, but keyA is something specific, then I can set the value for keyB anyway.

I have a solution that does what I want ([db-fiddle]), but it is painfully slow (9.6 hours) and I can't help think that there must be a better way as I've not been in this DB game long.

For info,
The table has ~8.2M rows and is hosted on AWS RDS on a t3.large DB (virtual CPUs = 2, Memory = 8.0GB).

Some pointers on where I can improve this are much appreciated.

Best Answer

You only need a single INSERT and a single UPDATE for the main table. The whole default detection and populating of the "variables" (columns) can be done while inserting the data:

INSERT INTO main_tags (id, tags, variablea, variableb)
select id, tags, var_a, 
       case 
          when var_b is null and 'valA' = any(tags)  then 'valB_default'
          when var_b is null and 'valA_xx' = any(tags) then 'valB_default_two'
          else var_b
       end as var_b
from (
  SELECT id, tags, 
         tags[array_position(tags, 'keyA') + 1] as var_a, 
         tags[array_position(tags, 'keyB') + 1] as var_b
  FROM (
    SELECT id, string_to_array(regexp_replace(tags,'[{}"]','','g'), ',') as tags
    FROM main
  ) as b 
) x  
WHERE tags && array['keyA','keyB'];

The final update can also handle the "unknown" value directly, no need to run three updates:

UPDATE main e
  SET variableA = coalesce(et.variableA, 'unknown'),
      variableB = coalesce(et.variableB, 'unknown')
FROM main_tags et
WHERE e.id = et.id;

One way to speed that up, is to create an index on main_tags (id)


You can actually get rid of the temp table completely and do everything in a single statement. This might or might not be faster, but I think it's worth trying:

UPDATE main e
  SET variableA = coalesce(et.var_a, 'unknown'),
      variableB = coalesce(et.var_b, 'unknown')
FROM (
  select id, tags, 
         var_a, 
         case 
            when var_b is null and 'valA' = any(tags)  then 'valB_default'
            when var_b is null and 'valA_xx' = any(tags) then 'valB_default_two'
            else var_b
         end as var_b
  from (
    SELECT id, 
           tags, 
           tags[array_position(tags, 'keyA') + 1] as var_a, 
           tags[array_position(tags, 'keyB') + 1] as var_b
    FROM (
      SELECT id, string_to_array(regexp_replace(tags,'[{}"]','','g'), ',') as tags
      FROM main
    ) as b 
  ) x  
  WHERE tags && array['keyA','keyB']
) as et
WHERE e.id = et.id;