Sql-server – The fastest way to get the most recent records

greatest-n-per-groupperformancequery-performancesql servert-sql

I am looking for the fastest way to return the most recent record for each reference number.

I quite liked the solution from BrentOzar.com, but it does not seem to work when I add a third condition (SequenceId). It appears to work only when I specify the Id and the creation date.

To understand my problem, You will need to create the modified sample table, which is essentially a copy of the table on the above referenced website but with a little twist.

CREATE TABLE [dbo].[TestTable](
      [Id] [int] NOT NULL,
      [EffectiveDate] [date] NOT NULL,
      [SequenceId] [bigint] IDENTITY(1,1) NOT NULL,
      [CustomerId] [varchar](50) NOT NULL,
      [AccountNo] [varchar](50) NOT NULL,
CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED 
(
      [Id] ASC,
      [EffectiveDate] ASC,
      [SequenceId] ASC
) ON [PRIMARY]
) ON [PRIMARY]
GO

SET IDENTITY_INSERT [dbo].[TestTable] ON
INSERT [dbo].[TestTable] ([Id], [EffectiveDate], [SequenceId], [CustomerId], [AccountNo]) 
VALUES (1, CAST(0xDF300B00 AS Date), 1, N'Blue', N'Green')
INSERT [dbo].[TestTable] ([Id], [EffectiveDate], [SequenceId], [CustomerId], [AccountNo]) 
VALUES (2, CAST(0xDF300B00 AS Date), 2, N'Yellow', N'Blue')
INSERT [dbo].[TestTable] ([Id], [EffectiveDate], [SequenceId], [CustomerId], [AccountNo]) 
VALUES (1, CAST(0xE0300B00 AS Date), 3, N'Red', N'Yellow')
INSERT [dbo].[TestTable] ([Id], [EffectiveDate], [SequenceId], [CustomerId], [AccountNo]) 
VALUES (3, CAST(0xE0300B00 AS Date), 4, N'Green', N'Purple')
INSERT [dbo].[TestTable] ([Id], [EffectiveDate], [SequenceId], [CustomerId], [AccountNo]) 
VALUES (1, CAST(0xE1300B00 AS Date), 5, N'Orange', N'Purple')
INSERT [dbo].[TestTable] ([Id], [EffectiveDate], [SequenceId], [CustomerId], [AccountNo]) 
VALUES (2, CAST(0xE3300B00 AS Date), 6, N'Blue', N'Orange')
INSERT [dbo].[TestTable] ([Id], [EffectiveDate], [SequenceId], [CustomerId], [AccountNo]) 
VALUES (3, CAST(0xE6300B00 AS Date), 7, N'Red', N'Blue')
SET IDENTITY_INSERT [dbo].[TestTable] OFF
GO

If I run a similar query as on the website, I get exactly the same result.

SELECT tt.*
FROM dbo.TestTable tt
LEFT OUTER JOIN dbo.TestTable ttNewer
ON tt.id = ttNewer.id AND tt.EffectiveDate < ttNewer.EffectiveDate
WHERE ttNewer.id IS NULL

However, the little twist is that I added a SequenceId column to the table as you may have noticed. The purpose of this column is because the client may want to do a post dated entry for a date in the past. This entry must supercede the other entries made on the same date which is in the past. If I run the query before I add the the post dated entries, I get the same result as previously.

SELECT tt.*
FROM dbo.TestTable tt
LEFT OUTER JOIN dbo.TestTable ttNewer
ON    ( 
            tt.id = ttNewer.id AND tt.EffectiveDate < ttNewer.EffectiveDate
            AND tt.SequenceId < ttNewer.SequenceId
      )
WHERE ttNewer.Id IS NULL

If I add two post dated entries as shown below, then I start to get interesting results.

INSERT INTO TestTable(Id,EffectiveDate,CustomerId,AccountNo) values
(
      2,'20090103','Blue','Orange'
);

INSERT INTO TestTable(Id,EffectiveDate,CustomerId,AccountNo) values
(
      2,'20090105','Blue','Orange'
);

What you should notice is that the two queries below no longer return the last record whether I use the one which is similar to what is on your website or I use the one which adds another condition (SequenceId)

SELECT tt.*
FROM dbo.TestTable tt
LEFT OUTER JOIN dbo.TestTable ttNewer
ON tt.id = ttNewer.id AND tt.EffectiveDate < ttNewer.EffectiveDate
WHERE ttNewer.id IS NULL

SELECT tt.*
FROM dbo.TestTable tt
LEFT OUTER JOIN dbo.TestTable ttNewer
ON    ( 
            tt.id = ttNewer.id AND tt.EffectiveDate < ttNewer.EffectiveDate
            AND tt.SequenceId < ttNewer.SequenceId
      )
WHERE ttNewer.Id IS NULL

What I would like the query to do is to return the last record for a reference number (Id) based on the last sequence number for any given day. In other words, the record with the last sequence number on the most recent EffectiveDate.

Best Answer

Self-joins seem cheap at low row counts, but I/O is exponential as the row count increases. I would prefer to solve this the CTE way, unless you are on SQL Server 2000 (please always specify the version you need to support, using a version-specific tag):

;WITH cte AS 
(
  SELECT Id, EffectiveDate, SequenceId, CustomerId, AccountNo, 
    rn = ROW_NUMBER() OVER (PARTITION BY Id 
      ORDER BY EffectiveDate DESC, SequenceId DESC)
  FROM dbo.TestTable
)
SELECT Id, EffectiveDate, SequenceId, CustomerId, AccountNo
  FROM cte
  WHERE rn = 1
  ORDER BY Id; -- because you can't rely on sorting without ORDER BY

This still has to scan, but it only has to scan once, compared to all the self-join variants, which will always have two scans (or potentially a scan and a seek executed multiple times, with better indexes).

If you want a more efficient query (eliminating an expensive sort, at the potential cost of writes, and perhaps other queries that don't need to support this sorting), change the primary key to match the query pattern:

PRIMARY KEY CLUSTERED 
(
  [Id] ASC,
  [EffectiveDate] DESC,
  [SequenceId] DESC
)

The direction of the secondary columns has no effect on uniqueness, and should change writes minimally as long as the table isn't terribly wide and there aren't an extremely large number of rows per id.