Sql-server – SQL Calculation based on ‘similar’ rows within table

sql serverssasssis

I'm using SQL Server 2008 R2, building a fact table for SSAS.

I can therefore accept solutions to this via either SQL query to SSIS task.

Let's say I have a fact table that looks a little like this, its a record of when people have eaten cakes, and how many (for illustrative purposes).

Surrogate Key | CakeType | PersonKey | CakesEaten | Date     | HadSeconds
1               20         1234        4            20120412   0
2               41         2345        3            20120401   0 
3               25         3456        4            20120408   0
4               39         2345        2            20120402   0
5               20         4567        3            20120409   0
6               89         9876        2            20120405   0
7               31         1579        4            20120408   0
8               29         2709        3            20120402   0
9               27         3456        3            20120410   0
10              26         2345        2            20120408   0

CakeType and PersonKey and Date are all foreign keys to dimensions.

CakesEaten and HadSeconds are measures.

Techie type question: What I want to do, is to mark a fact as 'HadSeconds' where a Person has a record in the table within 2 days of the previous fact.

Business type question: Show me the instances of people going to eat cake when they've already had some cake in the previous 2 days.

For example, Person '2345' has 3 records in this fact table, IDs 2, 4, 10.

I want to mark 'HadSeconds' for ID '4' as a '1', because the Date for ID 4 is within 2 days of ID2, therefore the person ate cake in quick succession, and this instance needs to be marked as 'HadSeconds'. ID '10' has a datestamp which is 6 days after ID '4', therefore the 'HadSeconds' column for ID 10 does not need to be marked.

Note that only records were a Person has multiple entries in the fact need to be considered.

The best (lol) thing? This fact table has approx. 60 million records in it.

What's the most efficient way to achieve this? Breaking stuff down into temporary tables? Sorting? Cursors (ew…), Aggregration tasks in SSIS…

Best Answer

Just an UPDATE using EXISTS will be your best bet:

UPDATE c
SET HadSeconds = 1
FROM Caketable C
WHERE EXISTS (SELECT 1
              FROM CakeTable C2
              WHERE c2.personid = c.personid
              AND c2.date > DATEADD(day, -2, c.date))

If you think this will run too long, you can always break it down by date range, cake type, etc to do smaller batches.

If you have a composite index on personid, date this should be pretty fast.

You could also do a self-join but I'm pretty confident the EXISTS will run a lot faster:

UPDATE c
SET HadSeconds = 1
FROM Caketable C
INNER JOIN CakeTable C2
  ON c2.personid = c.personid
  AND c2.date > DATEADD(day, -2, c.date)