Sql-server – Correct way to group by month and year

performancequery-performancesql serversql-server-2012

I need to group by month (and year) and I was thinking about:

GROUP BY CAST(YEAR(tDate) AS NVARCHAR(4)) + '-' + CAST(MONTH(tDate) AS NVARCHAR(2))

But I found in the web something like:

GROUP BY YEAR(tDate), Month(tDate)

Both are equivalent? It is better to use the second?

Best Answer

You should go with the second approach. All of the string concatenation and type conversion just adds unnecessary CPU overheard to the query.

I can try this out on the Stack Overflow schema. I'm going to query the user's table, and create a supporting nonclustered index on the date field:

CREATE NONCLUSTERED INDEX IX_CreationDate ON dbo.Users (CreationDate);

Here's an example of the first approach:

SELECT 
    CAST(YEAR(u.CreationDate) AS NVARCHAR(4)) 
        + '-' 
        + CAST(MONTH(u.CreationDate) AS NVARCHAR(2)), 
    COUNT(*)
FROM dbo.Users u
GROUP BY 
    CAST(YEAR(u.CreationDate) AS NVARCHAR(4)) 
        + '-' 
        + CAST(MONTH(u.CreationDate) AS NVARCHAR(2));

And the plan and stats:

plan for query 1

Table 'Worktable'. Scan count 0, logical reads 0
Table 'Workfile'. Scan count 0, logical reads 0
Table 'Users'. Scan count 1, logical reads 675

 SQL Server Execution Times:
   CPU time = 203 ms,  elapsed time = 258 ms.

Compare that to the second approach:

SELECT 
    YEAR(u.CreationDate), Month(u.CreationDate), 
    COUNT(*)
FROM dbo.Users u
GROUP BY YEAR(u.CreationDate), Month(u.CreationDate);

Here are the plan and stats for that one:

plan for query 2

Table 'Worktable'. Scan count 0, logical reads 0
Table 'Workfile'. Scan count 0, logical reads 0
Table 'Users'. Scan count 1, logical reads 675

 SQL Server Execution Times:
   CPU time = 125 ms,  elapsed time = 229 ms.

As you can see, the string concat / cast approach uses slightly more CPU. Everything else is the same.


By the way, the dbo.Users table only has about 300,000 rows in it. Going through this same exercise on the dbo.Posts table, which has about 3,700,000 rows in it, the query goes parallel, and the CPU savings for the second approach become much more significant.

So not only is the 2nd approach better in terms of CPU in the general case, it also appears to scale better as data size increases.