Postgresql: compare record against sibling, but limit search only to active records

postgresqlwindow functions

I'm trying to find a way to compare records against it's sibling record while limiting the comparison only to the most recent entry for each set.

I know it does not make to much sense so I'll go with an example.

Table 1: A list of category price-sheet info.

WITH category_price_sheets(category_id, reg_price_sheet_num, bulk_price_sheet_num) AS (
      VALUES
      (1, 1600, 1601),
      (2, 1602, 1603)
),

Table 2: Item cost entries for both regular cost and bulk cost.

prices(id, price_sheet_num, item_id, cost, created_at) AS (
  VALUES
  (1, 1600, 88, 2.00, '2016-05-06 00:00:00'::timestamp),
  (2, 1601, 88, 2.10, '2016-05-06 00:00:00'),
  (3, 1600, 88, 2.05, '2016-05-07 00:00:00'),
  (4, 1600, 88, 2.00, '2016-05-08 00:00:00'),
  (5, 1601, 88, 1.88, '2016-05-08 00:00:00'),
  (6, 1600, 88, 2.05, '2016-05-09 00:00:00'),

  (7, 1602, 88, 2.00, '2016-05-06 00:00:00'),
  (8, 1603, 88, 2.10, '2016-05-06 00:00:00'),
  (9, 1602, 88, 2.05, '2016-05-07 00:00:00'),
  (10, 1602, 88, 2.00, '2016-05-08 00:00:00'),
  (11, 1603, 88, 2.07, '2016-05-08 00:00:00'),
  (12, 1602, 88, 2.05, '2016-05-09 00:00:00')
),

How can I identify bulk pricing that is more expensive than it's regular counterpart, but only in case they're both the latest entry for same item.

So in the example above the bulk cost for item_id 88 in price-sheet 1601 is indeed a better price than it's counterpart in price-sheet 1600 but the same is not true when comparing 1603 against 1602.

I tried the following on a small sample set, and the results were correct. But when I tried it on the whole table, I pulled the plug after two hours with no results. (I'm dealing with more than 50,000,000 records)

bulk AS (
  SELECT v.* FROM (SELECT price.*, sheet.reg_price_sheet_num, sheet.bulk_price_sheet_num,        
      lead(price.id) OVER (PARTITION BY price_sheet_num, item_id) IS NULL AS latesetrecord   
    FROM prices price 
    JOIN category_price_sheets sheet ON sheet.bulk_price_sheet_num = price.price_sheet_num
  ) AS v
WHERE v.latesetrecord
)

SELECT v.* FROM (SELECT price.*, sheet.reg_price_sheet_num, sheet.bulk_price_sheet_num,        
      lead(price.id) OVER (PARTITION BY price_sheet_num, item_id) IS NULL AS latesetrecord   
    FROM prices price 
    JOIN category_price_sheets sheet ON sheet.reg_price_sheet_num = price.price_sheet_num
      WHERE price.cost < (
        SELECT cost FROM bulk blk WHERE blk.price_sheet_num = sheet.bulk_price_sheet_num 
          AND blk.item_id = price.item_id
      )
  ) AS v
WHERE v.latesetrecord

I think this can be done with some WINDOW functions but I couldn't get it to go.

Thank you all!

Best Answer

Your best (or simplest) choice may be using DISTINCT ON:

WITH latest_prices AS (
  SELECT DISTINCT ON (price_sheet_num, item_id) *
  FROM prices
  ORDER BY price_sheet_num, item_id, created_at DESC
  )
SELECT lpr.*, s.reg_price_sheet_num, s.bulk_price_sheet_num
FROM category_price_sheets s
JOIN latest_prices lpr ON (lpr.price_sheet_num = s.reg_price_sheet_num)
JOIN latest_prices lpb ON (lpb.price_sheet_num = s.bulk_price_sheet_num)
WHERE lpb.item_id = lpr.item_id AND lpb.cost > lpr.cost;

Ensure first there's an index for prices ON (price_sheet_num, item_id, created_at), or better still, prices ON (price_sheet_num, item_id, created_at DESC);

If this doesn't preform well enough, you may benefit from creating a temporary table:

CREATE TEMP TABLE latest_prices (LIKE prices) ON COMMIT DROP;
INSERT INTO latest_prices
SELECT DISTINCT ON (price_sheet_num, item_id) *
FROM prices
ORDER BY price_sheet_num, item_id, created_at DESC;
CREATE UNIQUE INDEX ON latest_prices (price_sheet_num, item_id);

SELECT lpr.*, s.reg_price_sheet_num, s.bulk_price_sheet_num
FROM category_price_sheets s
JOIN latest_prices lpr ON (lpr.price_sheet_num = s.reg_price_sheet_num)
JOIN latest_prices lpb ON (lpb.price_sheet_num = s.bulk_price_sheet_num)
WHERE lpb.item_id = lpr.item_id AND lpb.cost > lpr.cost;