Let there be two tables:
Users
id [pk] | name
--------+---------
1 | Alice
2 | Bob
3 | Charlie
4 | Dan
Emails
id | user_id | email
----+---------+-------
1 | 1 | a.1
2 | 1 | a.2
3 | 2 | a.3
4 | 2 | b.1
5 | 2 | a.4
6 | 2 | a.5
7 | 3 | b.2
8 | 3 | a.6
With a single query I want to retrieve:
- user's id and name
- count of user's emails
- user's email and its id
I'd like the output to be ordered descending by number of emails and filtered including only emails starting with 'a'. Users without emails shall be included, too – treat their emails' count as 0
.
There is my query:
SELECT users.id AS user_id, users.name AS name,
emails.id AS email_id, emails.email AS email,
count(emails.id) OVER (PARTITION BY users.id) as n_emails
FROM users
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY n_emails DESC;
And the (expected) result, it looks good:
user_id | name | email_id | email | n_emails
---------+---------+----------+-------+----------
2 | Bob | 6 | a.5 | 3
2 | Bob | 5 | a.4 | 3
2 | Bob | 3 | a.3 | 3
1 | Alice | 2 | a.2 | 2
1 | Alice | 1 | a.1 | 2
3 | Charlie | 8 | a.6 | 1
It's obvious that this is a simple and small example while the actual dataset could be large enough, so I'd like to use LIMIT
/OFFSET
for paging. For example, I'd like to fetch a first pair of users (not just rows):
-- previous query ...
LIMIT 2 OFFSET 0;
And… fail. I've got incomplete information about Bob only:
user_id | name | email_id | email | n_emails
---------+------+----------+-------+----------
2 | Bob | 6 | a.5 | 3
2 | Bob | 5 | a.4 | 3
Hence the question: how can I apply limit/offset to objects, in this case, users (logical entities, not rows)?
I've found such solution: add dense_rank()
over users.id and then filter by rank:
SELECT * FROM (
SELECT users.id AS user_id, users.name AS name,
emails.id AS email_id, emails.email AS email,
count(emails.id) OVER (PARTITION BY users.id) as n_emails,
dense_rank() OVER (ORDER BY users.id) as n_user
FROM users
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY n_emails DESC
) AS sq
WHERE sq.n_user <= 2; -- here it is
The output looks good:
user_id | name | email_id | email | n_emails | n_user
---------+-------+----------+-------+----------+--------
2 | Bob | 6 | a.5 | 3 | 2
2 | Bob | 5 | a.4 | 3 | 2
2 | Bob | 3 | a.3 | 3 | 2
1 | Alice | 2 | a.2 | 2 | 1
1 | Alice | 1 | a.1 | 2 | 1
But if you look at query plan, you'll see that the most expensive steps are subquery scan and sorting. AFAIK it is impossible to build index on subquery or CTE, so it will be always sequence scan/filter over n_user and query will execute for a long time on big dataset.
Another solution I see to make two queries:
- retrieve only user ids and number of emails for filtered and sorted dataset using subquery;
- join first subquery with users and emails
The query is:
SELECT users.id AS user_id, users.name,
emails.id AS email_id, emails.email,
sq.n_emails
FROM
(SELECT users.id, count(emails.id) AS n_emails
FROM users
LEFT JOIN emails ON users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
GROUP BY users.id
ORDER BY n_emails DESC
LIMIT 2 OFFSET 0 -- here it is
) AS sq
JOIN users ON users.id = sq.id
LEFT JOIN emails ON emails.user_id = users.id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY sq.n_emails DESC;
This seems to be much faster. But it doesn't look like good solution because I have to duplicate the exactly same query (except SELECT...FROM
part), in fact, one query runs two times. Is there any better solution?
Best Answer
Exclude users without emails
Assuming we only want users that actually have emails. Users without emails are ignored. The reason I went with this assumption at first is that all your queries do that already:
By adding a
WHERE
condition onemails.email
you effectively convert yourLEFT JOIN
to a plain[INNER] JOIN
and exclude users without emails. Detailed explanation:2nd query rewritten
Your 2nd query does not work as advertised, results are not "descending by number of emails". You have to nest the result of
count()
in another CTE or subquery and rundense_rank()
on it. You cannot nest window functions in the same query level.This should be fastest if the predicate is selective enough (selects only a small fraction of all emails). Two window functions with rows sorted differently have their price, too.
emails
only - which is possible if the preliminary assumption holds.3rd query improved
If, on the other hand, the predicate
WHERE e.email LIKE 'a' || '%'
is not very selective, your 3rd query is probably faster, even if it reads from the table twice - but the second time only desired rows. Also improved:Include users without emails
You can either include the users table in the inner query again, similar to what you had before. But you have to pull the filter on email into the join condition!
Which will be a bit more expensive.
Since you retrieve users with the most emails first, users without emails will rarely be in the result. To optimize performance, you could use a
UNION ALL
withLIMIT
:Detailed explanation:
MATERIALIZED VIEW
I would consider materializing the result for two reasons:
Build a MV from the 2nd query without
LIMIT
(REFRESH MATERIALIZED VIEW
), then return the first page etc. It's a matter of policy, when you refresh the MV again.