Postgresql – UPDATE Query replacing string primary key taking days to resolve

postgresqlupdate

I'm working with a DB (not designed by me) that had the primary key set to a 47 character length string, which is causing problems due to the fact that the table has over 20 million rows, and this number increases daily. I'm currently working from my development laptop, which is definitely not as powerful as the server, but I only downloaded part of the DB to test changes/speed of execution.

The pk has 4 parts, 2 are static and so I'm hoping to cut them, one is a semi-unique identifier (aka it should be unique but is not guaranteed to be unique) that could be stored as a bigint, and the other is a UUID-like value that is a pk in another table (with ~30,000 rows) that could be represented as a hex string literal (aka I can store as bytea and use it to output as a string when needed). So, I added serials to both tables, but I need to keep the semi-unique identifier and the reference to the other table as they will be necessary for output purposes, although after this they should be ignorable for joins etc.

So I wrote the following query (tables etc. renamed):

UPDATE bigtable set foreign_key = foreigntable.new_id,
semi_unique_id = (split_part(bigtable.old_id,'.',3))::bigint
FROM foreigntable
WHERE foreigntable.old_id = split_part(bigtable.old_id,'.',2);

I tested this query on 50 rows (AND WHERE bigtable.new_id IN (1,2,3....) and it looked like this would take several days to run.
I thought, maybe it was too much work to do lookups this way, maybe I'll do it in parts, the first query simply to split out the two pieces of information I need, and the second to actually get the information I want from the other table.

So it would look something like this:

UPDATE bigtable SET old_foreign_pk = decode(split_part(old_id,'.',2),'hex'), 
semi_unique_id = (split_part(old_id,'.',3))::bigint;

followed by

UPDATE bigtable SET foreign_key = foreigntable.new_id
FROM foreigntable
WHERE foreigntable.old_id = split_part(bigtable.old_id,'.',2);

However, running the query on 50-100 rows (using the method above) seems to suggest it will take 1-2 days to run. That kind of down-time is not really acceptable, can these queries be optimized or is this the cost of having a bad pk for so long?

Edit:

the old pk is stored as character varying and has this form:

log.501ad896c4960b089dd8638f.1344902965419.txt
log.4fe5242a785d0b080de707fa.1344903775142.txt

of which the first number uniquely identifies a row in another table, and the second is theoretically unique but is not guaranteed to be so. so the first is essentially 96 bits stored as a hex string that is a unique identifier (which is why it's only UUID-like), whereas the second is bigger than a integer but smaller than a bigint. the two, when combined do reference files, so they can't be discarded but it's not very efficient in terms of storage.

Best Answer

For the first query it can run fairy fast, but you'd need some temporary functional index for a join:

create index bigtable_old_id_2 on bigtable( split_part(bigtable.old_id,'.',2) );
create index foreigntable_old_id on foreigntable(old_id);

This will allow Potgres to quickly find matching row in bigtable.

You can drop these indexes afterwards if you don't need them anymore.