Postgresql – Optimising query on view that merges similar tables with a clear discriminator

optimizationperformancepostgresqlquery-performance

Using PostgreSQL 8.4, I have a number of tables that have a very similar structure, but that belong to different categories:

CREATE TABLE table_a (
    id SERIAL PRIMARY KEY,
    event_time TIMESTAMP WITH TIME ZONE NOT NULL,
    value_a REAL
);

CREATE TABLE table_b (
    id SERIAL PRIMARY KEY,
    event_time TIMESTAMP WITH TIME ZONE NOT NULL,
    value_b REAL
);

CREATE TABLE table_c (
    id SERIAL PRIMARY KEY,
    event_time TIMESTAMP WITH TIME ZONE NOT NULL,
    value_c REAL
);

I need to link these values to a central table (using joins or sub-selects depending on the query):

CREATE TABLE periods_table (
    id SERIAL PRIMARY KEY,
    start_time TIMESTAMP WITH TIME ZONE NOT NULL,
    end_time TIMESTAMP WITH TIME ZONE NOT NULL,
    category TEXT NOT NULL
);

Here, category is one of 'Category A', 'Category B' or 'Category C'.

In order to abstract the similarities between the A, B and C tables, I've created a view:

CREATE VIEW table_values AS
    SELECT 'Category A' AS category, event_time, value_a AS value
        FROM table_a
    UNION
    SELECT 'Category B' AS category, event_time, value_b AS value
        FROM table_b
    UNION
    SELECT 'Category C' AS category, event_time, value_c AS value
        FROM table_c;

A typical query would be something like:

SELECT p.start_time, p.end_time, p.category,
       (SELECT SUM(v.value) FROM table_values v
           WHERE v.category=p.category
             AND v.event_time >= t.start_time AND v.event_time < t.end_time)
    FROM periods_table p

The problem is that the category column that could potentially be used to discriminate between the separate tables in the view is only used at the end.

Even an EXPLAIN ANALYZE on SELECT * FROM table_values WHERE category='Category A' shows that all 3 tables are sub-queried when rows matching this criterion will ever only come from table_a:

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan table_values  (cost=176.02..295.50 rows=27 width=44) (actual time=0.135..0.135 rows=0 loops=1)
   Filter: (table_values.category = 'Category A'::text)
   ->  HashAggregate  (cost=176.02..229.12 rows=5310 width=12) (actual time=0.119..0.119 rows=0 loops=1)
         ->  Append  (cost=0.00..136.20 rows=5310 width=12) (actual time=0.089..0.089 rows=0 loops=1)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..45.40 rows=1770 width=12) (actual time=0.025..0.025 rows=0 loops=1)
                     ->  Seq Scan on table_a  (cost=0.00..27.70 rows=1770 width=12) (actual time=0.010..0.010 rows=0 loops=1)
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..45.40 rows=1770 width=12) (actual time=0.020..0.020 rows=0 loops=1)
                     ->  Seq Scan on table_b  (cost=0.00..27.70 rows=1770 width=12) (actual time=0.006..0.006 rows=0 loops=1)
               ->  Subquery Scan "*SELECT* 3"  (cost=0.00..45.40 rows=1770 width=12) (actual time=0.020..0.020 rows=0 loops=1)
                     ->  Seq Scan on table_c  (cost=0.00..27.70 rows=1770 width=12) (actual time=0.006..0.006 rows=0 loops=1)
 Total runtime: 0.437 ms
(11 rows)

(There's no data in this dummy table, but the actual tables have about 300000 rows. SELECT * FROM table_values WHERE category='Category A' there takes about 15 times longer than SELECT * FROM table_a, whereas they're more or less the same.)

There are indexes on event_time in each table, but since there can't be an index on the view's category, this doesn't help. I've also tried to replace the view with a CTE (since it sometimes leads to a different query path), but it didn't help.

Considering that I can't really change the existing tables, is there a way to "merge" a few tables like these that would lead to faster queries?

EDIT: Similar query on actual data. (There are in fact 5 similar tables here.)

Interestingly, although I'm querying on "Category E" here, there's a sort key in "Category A" that doesn't come from anywhere specifically in the query (I guess it must come from the first select in the view, or perhaps just uses the value from the first select to indicate the column name).

EXPLAIN ANALYZE SELECT * FROM table_values WHERE category='Category E':

Subquery Scan table_values  (cost=1573543.53..1755714.30 rows=40482 width=44) (actual time=221030.235..221234.162 rows=317676 loops=1)
  Filter: (table_values.category = 'Category E'::text)
  ->  Unique  (cost=1573543.53..1654508.32 rows=8096479 width=12) (actual time=212999.276..220240.297 rows=8097555 loops=1)
        ->  Sort  (cost=1573543.53..1593784.72 rows=8096479 width=12) (actual time=212999.275..218561.085 rows=8097555 loops=1)
              Sort Key: ('Category A'::text), "*SELECT* 1".event_time, "*SELECT* 1".value
              Sort Method:  external merge  Disk: 300792kB"
              ->  Append  (cost=0.00..229411.58 rows=8096479 width=12) (actual time=0.014..4683.734 rows=8097555 loops=1)
                    ->  Subquery Scan "*SELECT* 1"  (cost=0.00..80689.62 rows=2847831 width=12) (actual time=0.014..954.326 rows=2847951 loops=1)
                          ->  Seq Scan on table_a  (cost=0.00..52211.31 rows=2847831 width=12) (actual time=0.010..607.528 rows=2847951 loops=1)
                    ->  Subquery Scan "*SELECT* 2"  (cost=0.00..29304.52 rows=1033976 width=12) (actual time=9.738..576.803 rows=1034928 loops=1)
                          ->  Seq Scan on table_b  (cost=0.00..18964.76 rows=1033976 width=12) (actual time=9.737..450.619 rows=1034928 loops=1)
                    ->  Subquery Scan "*SELECT* 3"  (cost=0.00..30463.22 rows=1075161 width=12) (actual time=15.100..720.983 rows=1075157 loops=1)
                          ->  Seq Scan on table_c  (cost=0.00..19711.61 rows=1075161 width=12) (actual time=15.099..592.070 rows=1075157 loops=1)
                    ->  Subquery Scan "*SELECT* 4"  (cost=0.00..79952.70 rows=2821835 width=12) (actual time=20.098..1794.739 rows=2821843 loops=1)
                          ->  Seq Scan on table_d  (cost=0.00..51734.35 rows=2821835 width=12) (actual time=20.097..1441.719 rows=2821843 loops=1)
                    ->  Subquery Scan "*SELECT* 5"  (cost=0.00..9001.52 rows=317676 width=12) (actual time=0.016..108.768 rows=317676 loops=1)
                          ->  Seq Scan on table_e  (cost=0.00..5824.76 rows=317676 width=12) (actual time=0.016..69.732 rows=317676 loops=1)
Total runtime: 221299.573 ms

EXPLAIN ANALYZE SELECT * FROM table_e:

Seq Scan on table_e  (cost=0.00..5824.76 rows=317676 width=12) (actual time=0.025..54.143 rows=317676 loops=1)
Total runtime: 67.624 ms

Best Answer

There are a few problems with your query. It is quite clear that your 'wrapper' view - while looks like an elegant solution at first - kills the performance involving 7.7M completely unnecessary rows. This is because UNION requires all this data being sorted, and since those don't fit into memory (you can see this from Sort Method: external merge Disk: 300792kB), it gets 'swapped' to disk and sorted there, which is a very slow process.

As a first attempt, try recreating the 'wrapper' view with UNION ALL instead of simple UNION (you can find the difference here - note that in order to make the rows distinct, Postgres has to sort them first); this way you may avoid the sort. If the result is not good enough, try joining the five tables on the 'main' query one by one and UNION ALL the results.