Postgresql – A costly approach to filtering on a calculated date

postgresql

This builds off my previously wonderfully answered question Running a complex query for every date in a range I've got an expanded problem and I'm certainly not solving it in a performant way.

The order table setup is the same. A table of repeating orders that can be overridden. There's another table of order items, each order item has a quantity, a product, lead time in days and relates to the order table.

CREATE TABLE orders (
    id integer NOT NULL,
    client_id integer NOT NULL,
    start_date date NOT NULL,
    end_date date,
    order_type character varying NOT NULL
);

CREATE TABLE order_items (
    id integer NOT NULL,
    order_id integer NOT NULL,
    product_id integer NOT NULL,
    quantity integer NOT NULL,
    lead_days integer NOT NULL  
);

The problem was to figure out which orders were "active" on a range of ship dates allowing "temporary" orders to override "standing orders" on each specific date. We solved that with a query like this.

INSERT INTO orders 
  (id, client_id, start_date, end_date, order_type) VALUES 
  (1, 1, '2014-02-05', NULL, 'standing'), 
  (2, 2, '2014-07-16', '2015-07-19', 'standing'),
  (3, 3, '2015-04-01', NULL, 'standing'),
  (4, 3, '2015-07-18', '2015-07-18', 'temporary'),
  (5, 4, '2015-04-01', NULL, 'standing'),
  (6, 4, '2015-07-18', '2015-07-18', 'temporary');

SELECT DISTINCT
    ON (client_id) *
FROM
    orders
WHERE
    start_date <= DATE '2015-07-18'
AND (
    end_date IS NULL
    OR end_date >= DATE '2015-07-18'
)
ORDER BY
    client_id,
    order_type DESC

| id | client_id |                 start_date |               end_date | order_type |
|----|-----------|----------------------------|------------------------|------------|
|  1 |         1 | February, 05 2014 00:00:00 |                 (null) |   standing |
|  2 |         2 |     July, 16 2014 00:00:00 | July, 19 2015 00:00:00 |   standing |
|  4 |         3 |     July, 18 2015 00:00:00 | July, 18 2015 00:00:00 |  temporary |
|  6 |         4 |     July, 18 2015 00:00:00 | July, 18 2015 00:00:00 |  temporary |

This answers the shipping side of the equation, what needs to go out when? What products and what quantities are shipping on any given date?

Now from the production side of the company I need to know what products should get started an a given date. This is affected by the active orders on the date each product is shipping.

I'm solving it like this.

SELECT *
FROM (
  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    orders.order_type order_type,
    (DATE '2015-07-16' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
    -- find all order items that have active orders on the production_date + lead time
    DATE '2015-07-16' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2015-07-16' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work
WHERE
  -- reduce to order items that are active on their ship_date
  -- this step is why this query is so slow
  order_id in (
    SELECT DISTINCT ON (client_id) id
    FROM orders
    WHERE
      start_date <= ship_date and (end_date is NULL OR end_date >= ship_date)
      AND client_id = items_to_work.client_id
    ORDER BY client_id, order_type DESC
  )

| item_id | product_id | quantity | lead_days | order_id | client_id | order_type |              ship_date |
|---------|------------|----------|-----------|----------|-----------|------------|------------------------|
|       1 |          1 |        4 |         1 |        1 |         1 |   standing | July, 17 2015 00:00:00 |
|       2 |          2 |        5 |         2 |        2 |         2 |   standing | July, 18 2015 00:00:00 |
|       4 |          2 |        7 |         2 |        4 |         3 |  temporary | July, 18 2015 00:00:00 |
|       5 |          1 |        8 |         1 |        5 |         4 |   standing | July, 17 2015 00:00:00 |
|       6 |          2 |        9 |         2 |        6 |         4 |  temporary | July, 18 2015 00:00:00 |

Now I know that on '2015-07-16' I need to make 12 of product 1 and 21 of product 2. The problem is that this is super costly and unpredictable!

I tried to use the same technique from the orders query to determine active orders on the set of order items but I don't always get "temporary order" order items if their lead days are different. I also was able to reduce the dataset of order_items by filtering on the max(lead_days) but I left that out for simplicity. The real cost is in the subquery.

You can play around with the fiddle with data here.

Update
I tried a different approach which is a lot less SQL and more straight forward but actually more expensive. This cross joins the largest possible date range of active orders with the order items and filters for order items with active orders on their ship_date.

http://sqlfiddle.com/#!15/1e104/2

  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    active_orders_by_date.order_id order_id,
    active_orders_by_date.client_id client_id,
    active_orders_by_date.order_type order_type,
    (DATE '2015-07-16' + order_items.lead_days) ship_date
  FROM (
    SELECT DISTINCT ON (ship_date, client_id)
      id order_id,
      ship_date,
      client_id,
      order_type
    FROM orders
    INNER JOIN generate_series(
      DATE '2015-07-16',
      DATE '2015-07-16' + (SELECT max(lead_days) FROM order_items),
      '1 day'
    ) ship_date
    ON start_date <= ship_date and (end_date IS NULL OR end_date >= ship_date)
    ORDER BY ship_date, client_id, order_type DESC
  ) active_orders_by_date
  INNER JOIN order_items
  ON active_orders_by_date.order_id = order_items.order_id
    AND DATE '2015-07-16' = (ship_date::date - lead_days)

| item_id | product_id | quantity | lead_days | order_id | client_id | order_type |              ship_date |
|---------|------------|----------|-----------|----------|-----------|------------|------------------------|
|       1 |          1 |        4 |         1 |        1 |         1 |   standing | July, 17 2015 00:00:00 |
|       2 |          2 |        5 |         2 |        2 |         2 |   standing | July, 18 2015 00:00:00 |
|       4 |          2 |        7 |         2 |        4 |         3 |  temporary | July, 18 2015 00:00:00 |
|       5 |          1 |        8 |         1 |        5 |         4 |   standing | July, 17 2015 00:00:00 |
|       6 |          2 |        9 |         2 |        6 |         4 |  temporary | July, 18 2015 00:00:00 |

Update
Changed the fiddeles to have dates that show off the standing order temporary order interactions with lead time.

Best Answer

WARNING! Depending on authors response, my answer could be way off. Awaiting his reponse.

reconbot, we need more info, because it seems like your entire subquery

SELECT DISTINCT ON (client_id) id
FROM orders
WHERE
  start_date <= ship_date and (end_date is NULL OR end_date >= ship_date)
  AND client_id = items_to_work.client_id
ORDER BY client_id, order_type DESC

is completely redundant. You appear to be re-checking for already checked conditions.

For example, try just running

SELECT *
FROM (
  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
    -- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work;

and you get the exact same results. See the SQL Fiddle here. If you're still concerned about distinct client_ids, you could also use

SELECT *
FROM (
  SELECT DISTINCT ON (client_id)
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
    -- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work;

as in this SQL Fiddle.

So, I'm either misunderstanding, or your originally given data set doesn't reflect the significance of your sub-SELECT in the original question.

IGNORE EVERYTHING BELOW HERE...

Until we get clarification from the author...

I think I've got your solution. At the very least, I've made a SQLFiddle with the results, and it appears that it will be much less costly.

LATERAL sub-queries in PostgreSQL

I love chances when a LATERAL sub-query can be used to save some time in your query. Unfortunately, I think I'm pretty bad at explaining when and where it should be used, and I'm only OK at recognizing instances of when to use it. It just doesn't come up too often in my particular query designs.

Take a look at the Postgres documentation on LATERAL keyword for some ideas, and also I really like this SlideShare presentation by Markus Winand for helping to explain LATERAL a bit better. In essence, it has a flavor of a "for each" statement in typical pseudo-coding vernacular.

The reason I looked into it for your case was: you were building the items_to_work table, and then using the client_id attribute of items_to_work in your sub-SELECT, where you checked if items_to_work.order_id was IN the distinct returned set of orders.id values. Using items_to_work attributes in the WHERE clause of the sub-SELECT was the red flag for me.

OK, so I realize the explanation of my motivation isn't so hot, sorry! :P On to the results...

New Query

Without further ado, here it is:

SELECT *
FROM (
  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
-- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work,
LATERAL(
  SELECT id FROM orders
  WHERE orders.client_id = items_to_work.client_id
  AND start_date <= ship_date and (end_date is NULL OR end_date >= ship_date)
  AND items_to_work.order_id = orders.id) lat;

Also, you can note that we get back identical results as those you are expecting, except for a single added column on the right side, as part of the LATERAL selection, which is a replication of the orders.id. I'm not super experienced with LATERAL (it doesn't come up too often in my work), so if that's a problem for you, we can sort out a way to drop it. :P

EXPLAIN results

So, we don't have the big data set which you have, so that we can really test out the results of the new query versus the old. I'm relying on the EXPLAIN estimates here, but...

Using the Old SQLFiddle, we can see that the overall estimated cost is approximately 283,000 Postgres units. :P

Using the New SQLFiddle, we can see the much better estimated cost of only 204!!