How about an indexed view?
CREATE VIEW dbo.LoginsByDate
WITH SCHEMABINDING
AS
SELECT
UserID,
LoginDate = CONVERT(DATE, LoginDate),
LoginCount = COUNT_BIG(*)
FROM dbo.LoginTable
GROUP BY UserID, CONVERT(DATE, LoginDate);
GO
CREATE UNIQUE CLUSTERED INDEX IX_LoginsByDate
ON dbo.LoginsByDate(UserID, LoginDate);
GO
What this does is pre-aggregate the login data for you (you pay the cost at write time), and flattens it into one row per UserID per LoginDate. Depending on how many times a user logs in in a given day, and how often you run these reports, this can be substantially more efficient than calculating those at query time (and much less hassle than running your own nightly report that aggregates the data and stores it somewhere).
Now, to get the total login events by day:
SELECT LoginDate, SUM(LoginCount)
FROM dbo.LoginsByDate
GROUP BY LoginDate;
To get the distinct logins by day:
SELECT LoginDate, COUNT(UserID)
FROM dbo.LoginsByDate
GROUP BY LoginDate;
Of course if you're looking for individual user activity, you can do that too, and you can easily aggregate the data to get whatever "total unique" really means to you. If UserID = 1 logs in today 4 times and tomorrow twice, is that 1 total unique (unique per entire time frame) or 2 total uniques (unique per day)?
If the former:
SELECT COUNT(DISTINCT UserID)
FROM dbo.LoginsByDate
WHERE LoginDate >= 'yyyymmdd'
AND LoginDate < 'yyyymmdd';
If the latter:
SELECT SUM(c) FROM
(
SELECT LoginDate, c = COUNT(UserID)
FROM dbo.LoginsByDate
WHERE LoginDate >= 'yyyymmdd'
AND LoginDate < 'yyyymmdd'
GROUP BY LoginDate
) AS x;
Best Answer
Here are a few methods you can compare. First let's set up a table with some dummy data. I'm populating this with a bunch of random data from sys.all_columns. Well, it's kind of random - I'm ensuring that the dates are contiguous (which is really only important for one of the answers).
Results:
The data looks like this (5000 rows) - but will look slightly different on your system depending on version and build #:
And the running totals results should look like this (501 rows):
So the methods I am going to compare are:
self-join
This is the way people will tell you to do it when they're warning you to stay away from cursors, because "set-based is always faster." In some recent experiments I've found that the cursor out-paces this solution.
recursive cte with dates
Reminder - this relies on contiguous dates (no gaps), up to 10000 levels of recursion, and that you know the start date of the range you're interested (to set the anchor). You could set the anchor dynamically using a subquery, of course, but I wanted to keep things simple.
recursive cte with row_number
Row_number calculation is slightly expensive here. Again this supports max level of recursion of 10000, but you don't need to assign the anchor.
recursive cte with temp table
Stealing from Mikael's answer, as suggested, to include this in the tests.
quirky update
Again I am only including this for completeness; I personally wouldn't rely on this solution since, as I mentioned on another answer, this method is not guaranteed to work at all, and may completely break in a future version of SQL Server. (I'm doing my best to coerce SQL Server into obeying the order I want, using a hint for the index choice.)
cursor
"Beware, there be cursors here! Cursors are evil! You should avoid cursors at all costs!" No, that's not me talking, it's just stuff I hear a lot. Contrary to popular opinion, there are some cases where cursors are appropriate.
SQL Server 2012
If you are on the most recent version of SQL Server, enhancements to windowing functionality allows us to easily calculate running totals without the exponential cost of self-joining (the SUM is calculated in one pass), the complexity of the CTEs (including the requirement of contiguous rows for the better performing CTE), the unsupported quirky update and the forbidden cursor. Just be wary of the difference between using
RANGE
andROWS
, or not specifying at all - onlyROWS
avoids an on-disk spool, which will hamper performance significantly otherwise.performance comparisons
I took each approach and wrapped it a batch using the following:
Here are the results of the total duration, in milliseconds (remember this includes the DBCC commands each time as well):
And I did it again without the DBCC commands:
Removing both the DBCC and loops, just measuring one raw iteration:
Finally, I multiplied the row count in the source table by 10 (changing top to 50000 and adding another table as a cross join). The results of this, one single iteration with no DBCC commands (simply in the interests of time):
I only measured duration - I'll leave it as an exercise to the reader to compare these approaches on their data, comparing other metrics that may be important (or may vary with their schema/data). Before drawing any conclusions from this answer, it'll be up to you to test it against your data and your schema... these results will almost certainly change as the row counts get higher.
demo
I've added a sqlfiddle. Results:
conclusion
In my tests, the choice would be:
But again, you should test these against your schema and data. Since this was a contrived test with relatively low row counts, it may as well be a fart in the wind. I've done other tests with different schema and row counts, and the performance heuristics were quite different... which is why I asked so many follow-up questions to your original question.
UPDATE
I've blogged more about this here:
Best approaches for running totals – updated for SQL Server 2012