SQL Server – How to Get First Login-Hit on Huge Table

sql server

I have table like this in my database :

ID (pk,int,not null)

UserID (int,null)

HitDate (datetime,null)

Every time a user has any transacation in my application one hit inserted in this table.

Now my client want to find out how many new users I have per day from the beginning of application.

As I find out I have to group user by "UserID" and get the first HitDate, then remove other hit dates for that user. The first one will be the first login into system. Then I count using this method of all user per day.

My problem is this table has a huge row number, near 1,800,000,000 rows !!!!

The only solution I find at the moment is to break it into few day and insert it into a temporary table, then remove duplicates from it.

This is my code : (ID is Autoincrement)

INSERT INTO  temptable
(UserID , HitDate ) 
Select vs.UserID ,
        vs.HitDate 
        From UserHitTbl vs where
 LEN(vs.UserID ) > 1 and 
vs.HitDate  > '2013-06-12 14:32:59.783' and 
vs.HitDate < '2013-07-12 14:32:59.783' 
  DELETE
FROM temptable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM t6
GROUP BY UserID )

The above gives the following error:

Could not allocate space for object '' in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup

Best Answer

That error is simply telling you that the volume(s) you have tempdb on is/are full. Unless you have explicitly altered the file layout of tempdb it will all be on C:. You must be pretty low on space where tempdb is unless the amount of rows for that 24 hour period is massive.

Anyway, spooling large amounts of data into a temporary table onyl to delete most of them immediately is pretty much always the wrong way to go: with decent indexes on that table you should be able to extract the relevant data with a SELECT query like so:

SELECT user_id, MIN(action_date)
FROM   your_table
WHERE  action_date BETWEEN <start> AND <end>
GROUP BY user_id

The above will list every user who touched the system (so has entries in that table) between the dates/times specified along with the first time they touch the system in that time period.

It would be useful to list in your question what keys and indexes you currently have on that table. An index with action_date is going to be pretty much essential. A compound index that covers action_date and other columns involved may be even better, though will take more space to store, depending on the balance of data. How any unique users do you have and over how many days has this data been recorded? (again: add such details to your question so everyone sees them easily, not just in a comment added to this answer)

If you just want to list users who have never logged an action before this time, then you need something like:

SELECT user_id, action_date
FROM   (
       SELECT user_id, action_date=MIN(action_date)
       FROM   your_table
       WHERE  action_date BETWEEN <start> AND <end>
       GROUP BY user_id
       ) derived_table
WHERE NOT EXISTS (SELECT user_id FROM your_table WHERE action_date < <start> 
                  AND derived_table.user_id = your_table.user_id)

I'm using the original query as a derived table here to avoid the chance the the query planner will try to apply the NOT EXISTS check to every one of those 1,800,000,000 rows, instead this way is should definitely find the users who were active today and run the sub-query once for each of them once only (so the sub-query will run at most once per user). For this to work you are going to need an index covering user_id (one covering user_id, action_date would most likely be better still).

Of course if you don't have sufficient indexes already adding them is going to be a very I/O intensive operation as for each SQL Server will need to read all those 1.8 billion rows and write all the index rows to go with them.