PostgreSQL Pivot – Subtracting Sets Within a Grouped Column

pivotpostgresql

I'm not the best at explaining, but I have a table that has the format

CREATE TABLE foo
AS
  SELECT type,date::date,tp,price
  FROM ( VALUES
    ( 'A', '2010-10-01', 1, 0.05 ),
    ( 'A', '2010-10-01', 2, 1.04 ),
    ( 'B', '2010-10-01', 1, 0.53 ),
    ( 'B', '2010-10-01', 2, 1.04 ),
    ( 'C', '2010-10-01', 1, 0.05 ),
    ( 'C', '2010-10-01', 2, 1.02 ),
    ( 'D', '2010-10-01', 1, 0.05 ),
    ( 'D', '2010-10-01', 2, 1.08 )
  ) AS t(type,date,tp,price);

And what I want to do is subtract different types where the date and tp are the same. So that would be A-B, A-C, A-D, B-A, B-C, B-D, C-A, C-B, C-D, D-A, D-B, D-C.

To me this seems like I would want a wide format table, with the columns date, tp, A, B, C, D and then do a column-wise subtraction based on possible combinations. If the subtraction is less than 0, then the value is 0

The desired output looks something like this:

combo | date      | tp | price
---+------------+----+-------
A_B  | 2010-10-01 | 1  | 0
A_B  | 2010-10-01 | 2  | 0
A_C  | 2010-10-01 | 1  | 0
A_C  | 2010-10-01 | 2  | 0.02
A_D  | 2010-10-01 | 1  | 0
A_D  | 2010-10-01 | 2  | 0
and so on for all the combinations

Should I be looking at using crosstab ? Or is there a simpler/more elegant solution? My current solution is a view that uses CTE(s) to create all the possible dates and type combos, and then I have a function that goes through every typeA and typeB, date, tp combo. It is very slow.

The initial table in question is 2857658 rows

Best Answer

Simple self-join

What you want is something like this...

SELECT
  ARRAY[f1.type,f2.type] AS type,
  date,
  tp,
  greatest(f1.price-f2.price, '0.00') AS price
FROM foo AS f1
INNER JOIN foo AS f2
  USING (date, tp)
WHERE f1.type <> f2.type
ORDER BY f1.type, f2.type, tp;

I deviated a bit from what you desired. Generally, you don't want string concatenation like that. It's less useful. You're better off using an array. If you do want string concatenation just use f1.type || '_' || f2.type.

 type  |    date    | tp | price 
-------+------------+----+-------
 {A,B} | 2010-10-01 |  1 |  0.00
 {A,B} | 2010-10-01 |  2 |  0.00
 {A,C} | 2010-10-01 |  1 |  0.00
 {A,C} | 2010-10-01 |  2 |  0.02
 {A,D} | 2010-10-01 |  1 |  0.00
 {A,D} | 2010-10-01 |  2 |  0.00
 ...
 (24 rows)

Crosstab

Cross-tab does something totally different. It only ever changes the display.

SELECT *
FROM crosstab($$
  SELECT
    ARRAY[f1.type,f2.type] AS type,
    date,
    tp,
    greatest(f1.price-f2.price, '0.00') AS price
  FROM foo AS f1
  INNER JOIN foo AS f2
    USING (date, tp)
  WHERE f1.type <> f2.type
  ORDER BY f1.type, f2.type, tp;
$$, $$VALUES (1),(2)$$
) AS t(type text[],"date" date,tp1 numeric,tp2 numeric);

Produces...

 type  |    date    | tp1  | tp2  
-------+------------+------+------
 {A,B} | 2010-10-01 | 0.00 | 0.00
 {A,C} | 2010-10-01 | 0.00 | 0.02
 {A,D} | 2010-10-01 | 0.00 | 0.00
 {B,A} | 2010-10-01 | 0.48 | 0.00
 {B,C} | 2010-10-01 | 0.48 | 0.02
 {B,D} | 2010-10-01 | 0.48 | 0.00
 {C,A} | 2010-10-01 | 0.00 | 0.00
 {C,B} | 2010-10-01 | 0.00 | 0.00
 {C,D} | 2010-10-01 | 0.00 | 0.00
 {D,A} | 2010-10-01 | 0.00 | 0.04
 {D,B} | 2010-10-01 | 0.00 | 0.04
 {D,C} | 2010-10-01 | 0.00 | 0.06
(12 rows)

12 rows rather than 24. We pivoted and put tp1, and tp2 together. Pick your poison -- what kind of display do you want? I always prefer not using cross tab unless someone requires it.

Performance questions

My current solution is a view that uses CTE(s) to create all the possible dates and type combos, and then I have a function that goes through every typeA and typeB, date, tp combo. It is very slow.

We would need to actually see the views, and to an output of EXPLAIN ANALYZE to know why it runs slow. Try the query I suggested at the top and see if it speeds things up.

Related Solutions

Postgresql – How to change schema so that account_id reference is unique among 3 tables

Start with:

Create a ChartOfAccounts table with the Account code as Primary Key.
Add a Foreign Key constraint to ChartOfAccounts on all tables with an AccountCode field.
Use an IsDebit field, not the numeric sign, to distinguish Debits from Credits and reserve negative signs for transaction reversals (if used at all). This is necessary in order to generate T-Balances and Trial Balances properly from your Journal and Ledger
Create a Journal table with Primary Key: TransactionType, PostingDate, Account, SubledgerCode, IsDebit and minimum attributes of: Amount, CreatedDate, CreatedBy, DocumentReference
Design and spec a stored procedure (or type of routine suitable for where your business logic is located) for each type of transaction to be handled by the system. For your system these might be:
- Ticket Purchase for Cash
- Ticket Purchase on Account
- Prize Payout in Cash
- etc.

Please note that I am CGA, CPA in addition to being primarily a professional developer.

Update - Terminology:

A Journal is a chronological list of the details of all transactions of a given type, such as Cash Receipts, Cash Disbursements, Sales, etc.
A Ledger is a listing By Account of the aggregates of all transactions in a given time period.

It is occasionally necessary or expedient, when a wide variety of transaction types will be supported by the system (or to increase parallelism, as when many clerks need to be working at once Bob Cratchit style), to have multiple Journal files with different structure.

In a modern SQL Server system with only one Journal the Ledger could be defined as an Indexed View on the Journal. This would eliminate the need for either a trigger on the Journal to update the Ledger, or a batch-processing design.

Also, it is acceptable to have separate DrAmount and CrAmount columns in place of an IsDebit flag and single Amount column.

PostgreSQL – How to Delete Duplicate Records Efficiently

Core feature is the window function lag().
Also pay special attention to avoid deadlocks and race conditions with concurrent deletes and inserts (which can affect which rows to delete!):

CREATE OR REPLACE FUNCTION remove_vendor_price_dupes(_vendor int)
  RETURNS integer AS
$func$
DECLARE
   del_ct int;
BEGIN
   -- this may or may not be necessary:
   -- lock rows to avoid race conditions with concurrent deletes
   PERFORM 1
   FROM   vendor_prices
   WHERE  vendor = _vendor
   ORDER  BY sku, effective_date, id  -- guarantee row locks in consistent order
   FOR    UPDATE;

   -- delete redundant prices
   DELETE FROM vendor_prices v
   USING (
      SELECT id
           , price = lag(price) OVER w  -- same as last row
             AND (lead(id) OVER w) IS NOT NULL AS del  -- not last row
      FROM   vendor_prices
      WHERE  vendor = _vendor
      WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id)
      ) d
   WHERE v.id = d.id
   AND   d.del;

   GET DIAGNOSTICS del_ct = ROW_COUNT;  -- optional:
   RETURN del_ct;  -- return number of deleted rows
END
$func$  LANGUAGE plpgsql;

Call:

SELECT remove_vendor_price_dupes(1);

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.