PostgreSQL View – Convert HSTORE to EAV Using Each or Lateral

postgresqlview

I have a table with ~100M rows, all of which contain HSTORE fields. I am creating a view that will return (convert) the hstore column to key | value columns and also add a few fields for filtering.

id | hstore                         id  | user_key | user_value
--------------------------          --------------------------------
11 | "a" => "2", "b" => "z"   =>    10  | a        | 2
                               >    10  | b        | z

This view will be used by an export tool which knows sql, but doesn't grok hstore or json.

There are two reasonably nice solutions I have found, both using each, but I am curious if there are any drawbacks to one approach over the other, especially considering the desire to remove empty rows.

$1) using each

To filter rows without user data I would need to compare the underlying hstore to empty (user_data != '' :: hstore).

CREATE VIEW v1 AS
SELECT 
    t.id, 
    t.filterable_column,
    s.owner, 
    (each(t.user_data)).key, 
    (each(t.user_data)).value
FROM 
    public.hugetable AS t
        LEFT JOIN public.smalltable s ON s.id = t.s_id

$2) using lateral left join

In this case I can use an INNER join on the lateral to remove empty rows (rows with no user data). Seems cleaner.

CREATE VIEW v2 AS
SELECT 
    t.id, 
    t.filterable_column,
    s.owner, 
    u.key,
    u.value
FROM
    public.hugetable AS t
        LEFT JOIN LATERAL each(t.user_data) AS u ON TRUE
        LEFT JOIN public.smalltable s ON s.id = t.s_id

I can do explains on them, but I am not sure how to interepret since the difference seems small. Perhaps they are the same, essentially, or one is fundamentally worse than the other (more work, loops, or inefficent). I've no familiarity with each or lateral joins, but am reading up as I can.

Best Answer

Experimental approach

Tested on PostgreSQL 9.6.2, with out of the box configuration, on Mac OS X 10.12, with SSD as storage.

We create the small and huge tables (I call them s and t, for simplicity).

CREATE TABLE s
(
    s_id integer primary key,
    owner text
) ;

CREATE TABLE t
(
   t_id integer primary key,
   s_id integer REFERENCES s(s_id) ON UPDATE CASCADE ON DELETE RESTRICT,
   filterable_column integer,
   user_data hstore
) ;

I've assumed that t_id and s_id where primary keys of the respective tables (t and s) and that s_id references s(s_id). This setup creates proper indices.

We populate the tables with some data (2 M rows on t, to make timings a bit more reasonable; I've tested also with 1 M rows, comparisons continue to hold).

-- 19999 rows into s    
INSERT INTO
    s
    (s_id, owner)
SELECT
    s_id, 'owner' || s_id AS owner
FROM
    generate_series(1, 19999) s0(s_id) ;

-- 2 M rows into t
INSERT INTO
    t
    (t_id, s_id, filterable_column, user_data)
SELECT
    t_id, 
    (random()*19998+1)::integer AS s_id,
    (random()*19998+1)::integer AS filterable_column,
    ( 'a => ' || (random()*99+1)::integer || 
     ',b => ' || (random()*99+1)::integer)::hstore AS user_data
FROM
    generate_series(1, 2e6) AS s0(t_id) ;

We create now the two views:

CREATE VIEW v1 AS
SELECT 
    t.t_id, 
    t.filterable_column,
    s.owner, 
    (each(t.user_data)).key, 
    (each(t.user_data)).value
FROM 
    t
    LEFT JOIN s ON s.s_id = t.s_id ;

CREATE VIEW v2 AS
SELECT 
    t.t_id, 
    t.filterable_column,
    s.owner, 
    u.key,
    u.value
FROM
    t
    LEFT JOIN LATERAL each(t.user_data) AS u ON TRUE
    LEFT JOIN  s ON s.s_id = t.s_id ;

We perform a sanity check to make sure v1 and v2 give actually the same results:

-- Sanity check: are both views the same?
-- Check that the views are the same... 
SELECT
    (SELECT count(*) FROM
      (
        SELECT * FROM v1
        EXCEPT
        SELECT * FROM v2
      ) AS s0
    ) = 0 
    AND
    (SELECT count(*) FROM
      (
        SELECT * FROM v2
        EXCEPT
        SELECT * FROM v1
      ) AS s0
    ) = 0   AS v1_equals_v2 ;
-- Response: yes, they are

At this point, we compare how long does it take to read both views:

EXPLAIN ANALYZE
SELECT * FROM v1 ;

EXPLAIN ANALYZE
SELECT * FROM v2 ;

The execution plans and timings are this for v1, and this for v2. The timing for v1 is 18,857 s, and for v2 13,697. The ratio is, approximately 10:7. v2 performs better when you need to check all the data.

When you need to check for a range of values (for t_id):

EXPLAIN ANALYZE
SELECT * FROM v1
WHERE t_id BETWEEN 200000 and 210000 ;

You get this plan (v1), and this plan (v2). Timings are: 125 ms and 89 ms. Again, about a 10:7 ratio.

Conclusion

So, from a purely experimental point of view, I would use v2, it is about a 30% faster.

You can check all the tests (with less data) at rextester. (The difference in timings is much less when the number of data points is reduced.)