I have a table with ~100M rows, all of which contain HSTORE
fields. I am creating a view that will return (convert) the hstore
column to key | value
columns and also add a few fields for filtering.
id | hstore id | user_key | user_value
-------------------------- --------------------------------
11 | "a" => "2", "b" => "z" => 10 | a | 2
> 10 | b | z
This view will be used by an export tool which knows sql, but doesn't grok hstore or json.
There are two reasonably nice solutions I have found, both using each
, but I am curious if there are any drawbacks to one approach over the other, especially considering the desire to remove empty rows.
$1) using each
To filter rows without user data I would need to compare the underlying hstore to empty (user_data != '' :: hstore
).
CREATE VIEW v1 AS
SELECT
t.id,
t.filterable_column,
s.owner,
(each(t.user_data)).key,
(each(t.user_data)).value
FROM
public.hugetable AS t
LEFT JOIN public.smalltable s ON s.id = t.s_id
$2) using lateral
left join
In this case I can use an INNER join on the lateral to remove empty rows (rows with no user data). Seems cleaner.
CREATE VIEW v2 AS
SELECT
t.id,
t.filterable_column,
s.owner,
u.key,
u.value
FROM
public.hugetable AS t
LEFT JOIN LATERAL each(t.user_data) AS u ON TRUE
LEFT JOIN public.smalltable s ON s.id = t.s_id
I can do explains on them, but I am not sure how to interepret since the difference seems small. Perhaps they are the same, essentially, or one is fundamentally worse than the other (more work, loops, or inefficent). I've no familiarity with each
or lateral joins
, but am reading up as I can.
Best Answer
Experimental approach
Tested on PostgreSQL 9.6.2, with out of the box configuration, on Mac OS X 10.12, with SSD as storage.
We create the small and huge tables (I call them
s
andt
, for simplicity).I've assumed that
t_id
ands_id
where primary keys of the respective tables (t
ands
) and thats_id
references s(s_id). This setup creates proper indices.We populate the tables with some data (2 M rows on
t
, to make timings a bit more reasonable; I've tested also with 1 M rows, comparisons continue to hold).We create now the two views:
We perform a sanity check to make sure
v1
andv2
give actually the same results:At this point, we compare how long does it take to read both views:
The execution plans and timings are this for v1, and this for v2. The timing for
v1
is 18,857 s, and forv2
13,697. The ratio is, approximately 10:7.v2
performs better when you need to check all the data.When you need to check for a range of values (for t_id):
You get this plan (v1), and this plan (v2). Timings are: 125 ms and 89 ms. Again, about a 10:7 ratio.
Conclusion
So, from a purely experimental point of view, I would use
v2
, it is about a 30% faster.You can check all the tests (with less data) at rextester. (The difference in timings is much less when the number of data points is reduced.)