Postgresql – Postgres: “Pivot” unioned table based on one column

pivotpostgresqlunion

I have a query that does this sort of thing:

(SELECT rel_id, timestmap, y FROM table_1 AS full_
WHERE full_.timestamp BETWEEN %s AND %s
ORDER BY full_.timestamp)

UNION ALL

(SELECT rel_id, timestamp, y FROM table_2 AS full_
WHERE full_.timestamp BETWEEN %s AND %s
ORDER BY full_.timestamp)

UNION ALL

...

Which results in this form:

rel_id   timestamp    y
1        2013-01-01   a
1        2013-01-02   b
1        2013-01-03   c
1        2013-01-04   d
2        2013-01-01   e
2        2013-01-04   f
2        2013-01-06   g

What I'm doing here is directly selecting from partition tables, which are partitioned based on the ID of a related table (so, the first part selects from table_1, and that is all the rows that have a relationship to another table's row with ID 1, and so on). This works very well, and is fast (and avoids having to get the query planner to figure out which partition it should be looking at).

The problem is, to process it (using python pandas library), I need it in this form:

timestamp   1    2
2013-01-01  a    e
2013-01-02  b    null
2013-01-03  c    null
2013-01-04  d    f
2013-01-06  null g

I know it is possible to select the y column of each sub query as the rel_id (so as to make the name of the column the same as that of the relationship ID) – e.g. SELECT timestamp, y AS 1... – but I don't know how to 'merge' the resulting series of timestamp/values together in this way so that there are null values for a series that is missing a value at a given timestamp index.

The Pandas library can do this really easily, but it is pretty slow (slower than fetching the data in the first place), so I want to see if it is faster to make use of postgres's power…

Is this possible?
Is it likely to be fast?
Is it likely to be faster than processing in Python (my feeling is it will be)?

Best Answer

If it is faster you have to measure for yourself. However, doing it on the db side sends less data across, so I would assume it to be faster.

The pivot itself is fairly simple. I put your query's result in a table to make the example simpler.

SQL Fiddle

PostgreSQL 9.1.9 Schema Setup:

CREATE TABLE your_query
    ("rel_id" int, "timestamp" timestamp, "y" varchar(1))
;

INSERT INTO your_query
    ("rel_id", "timestamp", "y")
VALUES
    (1, '2013-01-01 00:00:00', 'a'),
    (1, '2013-01-02 00:00:00', 'b'),
    (1, '2013-01-03 00:00:00', 'c'),
    (1, '2013-01-04 00:00:00', 'd'),
    (2, '2013-01-01 00:00:00', 'e'),
    (2, '2013-01-04 00:00:00', 'f'),
    (2, '2013-01-06 00:00:00', 'g')
;

First step is to only return one row per date. That is simply done with a group by:

Query 1:

SELECT timestamp
  FROM your_query
 GROUP BY timestamp
 ORDER BY timestamp

Results:

|                      TIMESTAMP |
|--------------------------------|
| January, 01 2013 00:00:00+0000 |
| January, 02 2013 00:00:00+0000 |
| January, 03 2013 00:00:00+0000 |
| January, 04 2013 00:00:00+0000 |
| January, 06 2013 00:00:00+0000 |

Now wee need to pull the "correct" value into each column. For that we combine an aggregate with a case. The case returns null for all rows for which the condition is not met. the aggregate ignores nulls. That leaves the one value we are looking for:

Query 2:

SELECT timestamp,
       MAX(CASE WHEN rel_id = 1 THEN y END ) AS "1",
       MAX(CASE WHEN rel_id = 2 THEN y END ) AS "2"
  FROM your_query
 GROUP BY timestamp
 ORDER BY timestamp

Results:

|                      TIMESTAMP |      1 |      2 |
|--------------------------------|--------|--------|
| January, 01 2013 00:00:00+0000 |      a |      e |
| January, 02 2013 00:00:00+0000 |      b | (null) |
| January, 03 2013 00:00:00+0000 |      c | (null) |
| January, 04 2013 00:00:00+0000 |      d |      f |
| January, 06 2013 00:00:00+0000 | (null) |      g |

To make this work with your original query just replace your_query in my example with

(
(SELECT rel_id, timestmap, y FROM table_1 AS full_
WHERE full_.timestamp BETWEEN %s AND %s
ORDER BY full_.timestamp)

UNION ALL

(SELECT rel_id, timestamp, y FROM table_2 AS full_
WHERE full_.timestamp BETWEEN %s AND %s
ORDER BY full_.timestamp)

UNION ALL

...
) AS your_query

Related Solutions

Mysql – How to Perform Two Simple Queries using Select Value from First Query

Why not actually perform this as

SELECT ID INTO @ReturnedID FROM table WHERE column_b = 'a value';
select * from table as a where id = @ReturnedID
union all  
select * from table as a where id < @ReturnedID order by id desc limit 5
union all  
select * from table as a where id >= @ReturnedID order by id asc limit 15
;

PostgreSQL insert into table (not origin) based on a condition on fields on different tables

You seem to be under the impression that some kind of automatic "row number" would exist. That is not the case. Unlike rows in a spreadsheed, tables in a relational database have no natural order.

This query should do the job, but it relies on the the contents of name_1 and name_2 to make the connection. If you rely on a row number, you have to add an actual column for that.

INSERT INTO table2 (id1, id2, score1, score2, comment_string)
SELECT t1.id1, t1.id2, t1.score1, t1.score2
      ,CASE WHEN t3.name_1 IS NULL
            AND  t4.name_2 IS NULL THEN 'removed_because ...'
       ELSE END AS comment_string
FROM   table_1 t1
LEFT   JOIN table_3 t3 USING (name_1)
LEFT   JOIN table_4 t4 USING (name_2)
ORDER  BY id1; -- undeclared in Q

Based on the assumption that table_3.name_1 and table_4.name_2 are unique. Else, the query could create a "proxy cross join", possibly multiplying rows, if there are several matches.
More about this caveat in this related answer on SO.

Best Answer

Related Solutions

Mysql – How to Perform Two Simple Queries using Select Value from First Query

PostgreSQL insert into table (not origin) based on a condition on fields on different tables

Related Question