I have a table describing my application users with some details such as name, surname, birth date, nationality, email, etc…
I would like to known for each property, and for each user category, the most frequent value and the percentage of occurence.
For instance :
create table test ( userId int identity(1,1),
categoryId int,
name varchar(50),
surname varchar(50))
insert into test(categoryId, name, surname)
values (1, 'John', 'Locke'),
(1, 'John', 'Millar'),
(1, 'James', 'Mill'),
(1, 'John Stuart', 'Mill'),
(2, 'Thomas', 'Bayes'),
(2, 'Laurent', 'Schwartz'),
(2, 'Herrmann Amandus', 'Schwartz'),
(2, 'Thomas', 'Simpson'),
(2, 'Leonhard', 'Euler')
Result should be:
+------------+-------+--------+---------+----------+------------+
| categoryId | total | name | namePct | surname | surnamePct |
+------------+-------+--------+---------+----------+------------+
| 1 | 4 | John | 0.50 | Mill | 0.50 |
| 2 | 5 | Thomas | 0.40 | Schwartz | 0.40 |
+------------+-------+--------+---------+----------+------------+
For this simple example, I can figure how I can achieve that, with a query like:
select t.categoryId,
t.total,
n.name,
1. * n.total / t.total as namePct,
sn.surname,
1. * sn.total / t.total as surnamePct
from (
select categoryId, count(*) as total
from test
group by categoryId
) t
join (
select categoryId, name, total
from (
select categoryId, name, total, row_number() over(partition by categoryId order by total desc) as rn
from (
select categoryId, name, count(*) as total
from test
group by categoryId, name
) t
) t
where rn = 1
) n on t.categoryId = n.categoryId
join (
select categoryId, surname, total
from (
select categoryId, surname, total, row_number() over(partition by categoryId order by total desc) as rn
from (
select categoryId, surname, count(*) as total
from test
group by categoryId, surname
) t
) t
where rn = 1
) sn on t.categoryId = sn.categoryId
But, in my real use case my table has millions of rows, hundreds of categories and a dozen of attributes.
Is there a way to make the query simpler and more efficient (ie. without a bunch of sub select for each attribute) ?
I'm currently using SQL Server 2008, but answers using more recent versions are welcome.
Best Answer
You can use the windowed version of function
COUNT()
, split by each category withPARTITION BY
to get counts and totals without a subquery (note the lack ofGROUP BY
):Result:
Then you can use this result to get each percentage simply dividing the occurencies by each total. You can also calculate the best (most frequent) name and surname using a
ROW_NUMBER()
on this step:Result:
Finally, for each available category...
we can get the most frequent name and surname with their percentages with a few joins:
Result:
It might be a little big, but you can edit this query with as many new columns you want without adding new
SELECT
, just repeating the same logic with each new column you want to display and an additional join at the end.If you have millions of records and query takes long, you might want split each CTE into a temporary table with
SELECT ... INTO
+CREATE INDEX
bycategoryId
to speed up the process (if you are willing to spend a few resources by creating these tables)