PostgreSQL – Dynamically Convert hstore Keys to Columns

dynamic-sqlhstorepivotpostgresql

I have a database that stores a bunch of custom fields using hstore. In order to merge it into another database that doesn't support hstore, I'd like to split the keys into extra columns.

Users can add new custom fields and so I can't rely on knowledge of the keys ahead of time. Which makes the answer at "Attributes from an hstore-column as separate columns in a view?" not applicable for my problem.

Where a record doesn't have a key present in other records, it should get the same column with a null value.

How do I do this?

Best Answer

This can be done, very efficiently, too. Not in a single statement, though, since SQL demands to know the return type at call time. So you need two steps. The solution involves a number of advanced techniques ...

Assuming the same table as @Denver in his answer:

CREATE TABLE hstore_test (
  id serial PRIMARY KEY
, hstore_col hstore
);

Solution 1: Simple `SELECT`

After I wrote the crosstab solution below it struck me that a simple "brute force" solution is probably faster. Basically, the query @Denver already posted, built dynamically:

Step 1a: Generate query

SELECT format(
     'SELECT id, h->%s
      FROM  (SELECT id, hstore_col AS h FROM hstore_test) t;'
    , string_agg(quote_literal(key) || ' AS ' || quote_ident(key), ', h->')
   ) AS sql 
FROM  (
   SELECT DISTINCT key
   FROM   hstore_test, skeys(hstore_col) key
   ORDER  BY 1
   ) sub;

The subquery (SELECT id, hstore_col AS h FROM hstore_test) is just to get in the column alias h for your hstore column.

Step 1b: Execute query

This generates a query of the form:

SELECT id, h->'key1' AS key1, h->'key2' AS key2, h->'key3' AS key3
FROM  (SELECT id, hstore_col AS h FROM hstore_test) t;

Result:

 id | key1  | key2  | key3
----+-------+-------+-------
  1 | val11 | val12 | val13
  2 | val21 | val22 |
  3 |       |       |        -- for a row where hstore_col IS NULL

Solution 2: `crosstab()`

For lots of keys this may perform better. Probably not. You'll have to test. Result is the same as for solution 1.

You need the additional extension tablefunc which provides the crosstab() function. Read this first if you are not familiar:

PostgreSQL Crosstab Query

Step 2a: Generate query

SELECT format(
   $s$SELECT * FROM crosstab(
     $$SELECT h.id, kv.*
       FROM   hstore_test h, each(hstore_col) kv
       ORDER  BY 1, 2$$
   , $$SELECT unnest(%L::text[])$$
   ) AS t(id int, %s text);
   $s$
 , array_agg(key)  -- escapes strings automatically
 , string_agg(quote_ident(key), ' text, ')  -- needs escaping!
   ) AS sql 
FROM  (
   SELECT DISTINCT key
   FROM   hstore_test, skeys(hstore_col) key
   ORDER  BY 1
   ) sub;

Note the nested levels of dollar-quoting.

I use this explicit form in the main query instead of the short CROSS JOIN in the auxiliary query to preserve rows with empty or NULL hstore values:

LEFT   JOIN LATERAL each(hstore_col) kv ON TRUE

What is the difference between LATERAL and a subquery in PostgreSQL?

Dynamic alternative to pivot with CASE and GROUP BY

Step 2b: Execute query

This generates a query of the form:

SELECT * FROM crosstab(
     $$SELECT h.id, kv.*
       FROM   hstore_test h
       LEFT   JOIN LATERAL each(hstore_col) kv ON TRUE
       ORDER  BY 1, 2$$
   , $$SELECT unnest('{key1,key2,key3}'::text[])$$
   ) AS t(id int, key1 text, key2 text, key3 text);

You may want to inspect it for plausibility before running the first time. This should deliver optimized performance.

Notes

Both solutions work for any number of keys up to the physical limit of ~ 1600 columns in a Postgres table.
Both also work for keys of any shape or form up to the maximum length for identifiers, which is 63 bytes per default.
Besides the hstore function each() that was already mentioned by s.m., I also use the related function skeys() to identify keys.
Be sure to quote column names correctly to avoid possible SQL injection attacks by way of maliciously formed key names. I take care of that with quote_literal() and quote_ident().

Related Solutions

Postgresql – Use case for hstore

At the scope at which you're working. I think JSONB is ideal. It handles deeply nested structures and structures with array-keys. It's also standardized and in the spec for sql2016.

In addition, as I answered here, there is an extension that should help you with space consumption called ZSON,

ZSON is a PostgreSQL extension for transparent JSONB compression. Compression is based on a shared dictionary of strings most frequently used in specific JSONB documents (not only keys, but also values, array elements, etc).

In some cases ZSON can save half of your disk space and give you about 10% more TPS. Memory is saved as well. See docs/benchmark.md. Everything depends on your data and workload, though. Don't believe any benchmarks, re-check everything on your data, configuration, hardware, workload and PostgreSQL version.

You may want to look into ZSON.

PostgreSQL – Updating an hstore with Multiple New Columns

you can concatenate them:

set hstore_col = hstore('new_key_1',foo.new_value_1)||hstore('new_key_2',foo.new_value_2)

Alternatively use the constructor function that accepts two arrays:

set hstore_col = hstore(array['new_key_1', 'new_key_2'],array[foo.new_value_1, foo.new_value_2])

The second solution will be easier to extend when you have more keys then just two

If you don't want to overwrite the current value, then append that to the original one:

set hstore_col = hstore_col || hstore('new_key_1',foo.new_value_1)||hstore('new_key_2',foo.new_value_2)

set hstore_col = hstore_col || hstore(array['new_key_1', 'new_key_2'],array[foo.new_value_1, foo.new_value_2])