Sql-server – GROUP BY with MAX versus just MAX

group byperformancequery-performancesql server

I am a programmer, dealing with a big table which the following scheme:

UpdateTime, PK, datetime, notnull
Name, PK, char(14), notnull
TheData, float

There is a clustered index on Name, UpdateTime

I was wondering what should be faster:

SELECT MAX(UpdateTime)
FROM [MyTable]

or

SELECT MAX([UpdateTime]) AS value
from
   (
    SELECT [UpdateTime]
    FROM [MyTable]
    group by [UpdateTime]
   ) as t

The inserts to this table are in chunks of 50,000 rows with the same date.
So I thought grouping by might ease the MAX calculation.

Instead of trying to find max of 150,000 rows, grouping by to 3 rows, and then calculation of MAX would be faster ? Is my assumption correct or group by is also costly ?

Best Answer

I created the table big_table according to your schema

create table big_table
(
    updatetime datetime not null,
    name char(14) not null,
    TheData float,
    primary key(Name,updatetime)
)

I then filled the table with 50,000 rows with this code:

DECLARE @ROWNUM as bigint = 1
WHILE(1=1)
BEGIN
    set @rownum  = @ROWNUM + 1
    insert into big_table values(getdate(),'name' + cast(@rownum as CHAR), cast(@rownum as float))
    if @ROWNUM > 50000
        BREAK;  
END

Using SSMS, I then tested both queries and realized that in the first query you are looking for the MAX of TheData and in the second, the MAX of updatetime

I thus modified the first query to also get the MAX of updatetime

set statistics time on -- execution time
set statistics io on -- io stats (how many pages read, temp tables)

-- query 1
SELECT MAX([UpdateTime])
FROM big_table

-- query 2
SELECT MAX([UpdateTime]) AS value
from
   (
    SELECT [UpdateTime]
    FROM big_table
    group by [UpdateTime]
   ) as t


set statistics time off
set statistics io off

Using Statistics Time I get back the number of milliseconds required to parse, compile, and execute each statement

Using Statistics IO I get back information about disk activity

STATISTICS TIME and STATISTICS IO provide useful information. Such as were temporary tables used (indicated by worktable). Also how many logical pages read were read which indicates the number of database pages read from cache.

I then activate Execution plan with CTRL+M (activates show actual execution plan) and then execute with F5.

This will provide a comparison of both queries.

Here is the output of the Messages Tab

-- Query 1

Table 'big_table'. Scan count 1, logical reads 543, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 16 ms, elapsed time = 6 ms.

-- Query 2

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'big_table'. Scan count 1, logical reads 543, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 35 ms.

Both queries result in 543 logical reads, but the second query has an elapsed time of 35ms where as the first has only 6ms. You will also notice that the second query results in the use of temporary tables in tempdb, indicated by the word worktable. Even though all the values for worktable are at 0, work was still done in tempdb.

Then there is the output from the actual Execution plan tab next to the Messages tab

enter image description here

According to the execution plan provided by MSSQL, the second query you provided has a total batch cost of 64% whereas the first only costs 36% of the total batch, so the first query requires less work.

Using SSMS, you can test and compare your queries and find out exactly how MSSQL is parsing your queries and what objects: tables, indexes and/or stats if any are being used to satisfy those queries.

One additional side note to keep in mind when testing is cleaning out the cache before testing, if possible. This helps to ensure that comparisons are accurate and this is important when thinking about disk activity. I start off with DBCC DROPCLEANBUFFERS and DBCC FREEPROCCACHE to clear out all the cache. Be careful though not to use these commands on a production server actually in use as you will effectively force the server to read everything from disk into memory.

Here is the relevant documentation.

  1. Clear the plan cache with DBCC FREEPROCCACHE
  2. Clear out everything from the buffer pool with DBCC DROPCLEANBUFFERS

Using these commands may not be possible depending on how your environment is used.

Updated 10/28 12:46pm

Made corrections to the execution plan image and statistics output.