Postgresql – Left join without duplicate rows

greatest-n-per-grouppostgresql

I have two tables called record and record_history. For each record, there could be more than one history. They can be joined by id and record_id. I want to get all the record's entries with recent record_history data. I have created the query like,

SELECT rec.id, rec.name, rech1.data AS last_history_data
FROM record rec
LEFT OUTER JOIN record_history rech1 ON (rec.id = rech1.record_id)
LEFT OUTER JOIN record_history rech2 ON (rec.id = rech2.record_id AND rech2.ts > rech1.ts)
WHERE rech2.id IS NULL
ORDER BY rec.id DESC

Here, I am getting the latest one by ts. This works as long as there are no duplicate ts entries. If the recent timestamp is repeated in record_history, this query returns more than one row for a record. How can we apply the limit here on the left join to restrict duplicate rows?

Best Answer

Unless you are in a very old version of Postgres, you don't need the double join. You can get the same result by using a LATERAL join.

The duplicate results can be avoided in your method by adding a second condition besides the rec.id = rech2.record_id. With the LATERAL join method, the use of LIMIT is avoiding it anyway. There can be only 1 row returned from the lateral subquery. We can add a second condition so the choice is deterministic (from the two or more rows with same timestamp):

SELECT rec.id, rec.name, rech.data AS last_history_data
FROM record AS rec
     LEFT OUTER JOIN LATERAL
     ( SELECT rech.data
       FROM record_history AS rech
       WHERE rec.id = rech.record_id
       ORDER BY rech.ts DESC
                -- ,rech.id DESC               -- optional
       LIMIT 1 
     ) AS rech
     ON TRUE
ORDER BY rec.id DESC ;

Regarding how to do this with the original method (2 joins and IS NULL check), you could change the ON condition - assuming there is an id column in history table so that (id) or at least (ts, id) is unique:

LEFT OUTER JOIN record_history rech2 
ON rec.id = rech2.record_id 
   AND (rech2.ts > rech1.ts OR rech2.ts = rech1.ts AND rech2.id > rech1.id)

By the way, you could replace that second LEFT join and IS NULL check with a NOT EXISTS subquery with same results and possibly similar efficiency (or even with a NOT IN subquery although that needs extra care for nullable columns, not recommended).