Postgresql – Optimizing query using view on EAV structure

An application is writing into a database that follows an EAV structure, similar to this:

CREATE TABLE item (
    id INTEGER PRIMARY KEY,
    description TEXT
);

CREATE TABLE item_attr (
    item INTEGER REFERENCES item(id),
    name TEXT,
    value INTEGER,
    PRIMARY KEY (item, name)
);

INSERT INTO item VALUES (1, 'Item 1');
INSERT INTO item_attr VALUES (1, 'height', 20);
INSERT INTO item_attr VALUES (1, 'width', 30);
INSERT INTO item_attr VALUES (1, 'weight', 40);
INSERT INTO item VALUES (2, 'Item 2');
INSERT INTO item_attr VALUES (2, 'height', 10);
INSERT INTO item_attr VALUES (2, 'weight', 35);

(I gather that EAV is a bit controversial, but this question isn't about EAV: this legacy application can't be changed anyway.)

There can be a number of attributes, but usually, up to 200 attributes per items (often similar). Out of these 200 attributes, there's a group of about 25 that are more common than the others and that are used more often in queries.

To make it easier to write new queries based on some of those 25 attributes (the requirements tend to change and I need to be flexible), I have written a view that joins the attribute table for these 25 attributes. Following the example above, this looks like this:

CREATE VIEW exp_item AS SELECT
   i.id AS id,
   i.description AS description,
   ia_height.value AS height,
   ia_width.value AS width,
   ia_weight.value AS weight,
   ia_depth.value AS depth
FROM item i
  LEFT JOIN item_attr ia_height ON i.id=ia_height.item AND ia_height.name='height'
  LEFT JOIN item_attr ia_width ON i.id=ia_width.item AND ia_width.name='width'
  LEFT JOIN item_attr ia_weight ON i.id=ia_weight.item AND ia_weight.name='weight'
  LEFT JOIN item_attr ia_depth ON i.id=ia_depth.item AND ia_depth.name='depth';

A typical report would only make use of a few of those 25 attributes, for example:

SELECT id, description, height, width FROM exp_item;

Some of these queries are not quite as fast as I wish they were. Using EXPLAIN, I have noticed that the joins on the unused columns were still made, which, on about 25 joins when only 3 or 4 attributes are used, is causing an unnecessary degradation in performance.

Of course, performing all the LEFT JOINs in the view is normal, but I'm wondering if there would be a way to keep this view (or something similar: I'm mainly interested in using a view to simplify the way I refer to attributes, more or less as if they were columns) and to avoid (automatically) to use joins on the unused attributes for a particular query.

The only workaround I've found so far is to define a specific view for each of these queries, that only joins based on the attributes that are used. (This does improve the speed, as expected, but requires more programming of views every time, thus a bit less flexibility.)

Is there a better way to do this? (It there a better way to "pretend" the EAV structure is a single well-structured table, from the point of view of writing the queries, and not to have to make these unnecessary left joins?)

I'm using PostgreSQL 8.4. There are about 10K rows in item and about 500K rows in item_attr. I wouldn't expect more than 80K rows in item and a 4M rows in item_attr, which I believe a modern system can handle without too much problem. (Comments regarding other RDBMS/versions are welcome too.)

EDIT: Just to expand on the usage of indices in this example.

The PRIMARY KEY (item, name) implicitly creates an index on (item, name), as documented in the CREATE TABLE documentation. Considering that both item and name are used with an equality constraint in the JOIN, this index seems suitable according to the documentation on multi-column indexes.

The following example shows this index seems to be used, as expected, without any explicit additional index:

EXPLAIN SELECT id, description, height, width FROM exp_item WHERE width < 100;

                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=28.50..203.28 rows=10 width=20)
   ->  Nested Loop Left Join  (cost=28.50..196.73 rows=10 width=16)
         ->  Nested Loop Left Join  (cost=28.50..190.18 rows=10 width=16)
               ->  Hash Join  (cost=28.50..183.64 rows=10 width=16)
                     Hash Cond: (ia_width.item = i.id)
                     ->  Seq Scan on item_attr ia_width  (cost=0.00..155.00 rows=10 width=8)
                           Filter: ((value < 100) AND (name = 'width'::text))
                     ->  Hash  (cost=16.00..16.00 rows=1000 width=12)
                           ->  Seq Scan on item i  (cost=0.00..16.00 rows=1000 width=12)
               ->  Index Scan using item_attr_pkey on item_attr ia_depth  (cost=0.00..0.64 rows=1 width=4)
                     Index Cond: ((i.id = ia_depth.item) AND (ia_depth.name = 'depth'::text))
         ->  Index Scan using item_attr_pkey on item_attr ia_weight  (cost=0.00..0.64 rows=1 width=4)
               Index Cond: ((i.id = ia_weight.item) AND (ia_weight.name = 'weight'::text))
   ->  Index Scan using item_attr_pkey on item_attr ia_height  (cost=0.00..0.64 rows=1 width=8)
         Index Cond: ((i.id = ia_height.item) AND (ia_height.name = 'height'::text))

Best Answer

This is one (of many) downsides of EAV designs.

You can't really improve the JOIN: because of the necessary complexity, a cost based optimiser won't get to the perfect plan. It finds "good enough"

Suggestions:

don't use a view: use aggregate type queries (eg COUNT(*) = 2 if I match both height and weight)
use a trigger to maintain a real (or sparse) table and query that

The first option scales better becauses a few indexes on the main EAV fact table can cover all queries nicely.

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Postgresql – Running a UPDATE on 3 million records

Related Question