Postgresql – sorting groups of related rows by average values while keeping the groups together

postgresqlsortingwindow functions

I'm using PostgreSQL 8.4, but would like a standard SQL solution if possible.
Consider the following table.

corrmodel=# SELECT * from data limit 1;
   id    | datasubgroup_id | datafile_id |           sequence           | index | seqindex | margstat | pvalue 
---------+-----------------+-------------+------------------------------+-------+----------+----------+--------
 1033473 |               3 |          10 | GGTGACCCCAAGCTCAGGGCTGACCTGC | 19042 |          |  70.7634 |      0

I want to return the query that has the following properties.

All rows with the same datafile_id and index are grouped
together.
The groups are sorted first by average pvalue descending, then by average
margstat, where the averages are across each group.

The two tables I am doing this query on have 2.2 million and 3.1 million rows, so I'd like something reasonably efficient. Each group consists of 5 rows. This solution by @Lamak
works, but I had some trouble wrapping my head around it, and I think that a solution using window functions might be something I could actually understand. The following is close, but not correct, since the group is not preserved in this case.

SELECT datafile_id, 
       index, 
       pvalue, 
       margstat, 
       Avg(pvalue) 
         OVER ( 
           partition BY datafile_id, index) AS avg_pval, 
       Avg(margstat) 
         OVER ( 
           partition BY datafile_id, index) AS avg_margstat 
FROM   data 
ORDER  BY avg_pval DESC, 
          avg_margstat;

Here is the first 10 rows of the query result for one of my data sets. I'd like something like this, but correct.

datafile_id | index | pvalue | margstat | avg_pval | avg_margstat 
-------------+-------+--------+----------+----------+--------------
          30 |   781 |      1 |  13.1568 |    0.998 |     12.52546
          30 |   781 |      1 |  12.3585 |    0.998 |     12.52546
          30 |   781 |      1 |  12.3495 |    0.998 |     12.52546
          30 |   781 |   0.99 |  11.9554 |    0.998 |     12.52546
          30 |   781 |      1 |  12.8071 |    0.998 |     12.52546
          23 |  1428 |   0.99 |  12.1711 |    0.998 |      12.6777
          23 |  1428 |      1 |  12.6451 |    0.998 |      12.6777
          23 |  1428 |      1 |  12.8814 |    0.998 |      12.6777
          23 |  1428 |      1 |  12.8969 |    0.998 |      12.6777
          23 |  1428 |      1 |   12.794 |    0.998 |      12.6777

Best Answer

As @ypercube pointed out in the comments, my query is quite close to the correct answer. The sort by avg_pval DESC, avg_margstat is actually close to the correct sort, only incorrect if the (avg_pval, margstat) tuple happens to have ties. So one can sort again, for a fixed (avg_pval, margstat), on datafile_id, index which will bring the groups back together. Finally, one can optionally sort within the groups, by pvalue DESC, margstat, Putting that all together, one gets

SELECT datafile_id, 
       index, 
       pvalue, 
       margstat, 
       Avg(pvalue) 
         OVER ( 
           partition BY datafile_id, index) AS avg_pval, 
       Avg(margstat) 
         OVER ( 
           partition BY datafile_id, index) AS avg_margstat 
FROM   data 
ORDER  BY avg_pval DESC, 
          avg_margstat, 
          datafile_id,
          index,
          pvalue DESC, 
          margstat;

Exclude users without emails

Assuming we only want users that actually have emails. Users without emails are ignored. The reason I went with this assumption at first is that all your queries do that already:

LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'

By adding a WHERE condition on emails.email you effectively convert your LEFT JOIN to a plain [INNER] JOIN and exclude users without emails. Detailed explanation:

Query with LEFT JOIN not returning rows for count of 0

2nd query rewritten

Your 2nd query does not work as advertised, results are not "descending by number of emails". You have to nest the result of count() in another CTE or subquery and run dense_rank() on it. You cannot nest window functions in the same query level.

SELECT u.name, e2.*
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT user_id, id AS e_id, e_mail
           , count(*) OVER (PARTITION BY user_id) AS n_emails          
      FROM   emails
      WHERE  email LIKE 'a' || '%'  -- one % is enough
      ) e1
   ) e2
JOIN   users u ON u.id = e2.user_id
WHERE  rnk < 3
ORDER  BY rnk;

This should be fastest if the predicate is selective enough (selects only a small fraction of all emails). Two window functions with rows sorted differently have their price, too.

A major point is to run the subquery on emails only - which is possible if the preliminary assumption holds.

3rd query improved

If, on the other hand, the predicate WHERE e.email LIKE 'a' || '%' is not very selective, your 3rd query is probably faster, even if it reads from the table twice - but the second time only desired rows. Also improved:

SELECT e.user_id, u.name,
       e.id AS e_id, e.e_mail, sq.n_emails
FROM  (
   SELECT user_id, count(*) AS n_emails
   FROM   emails
   WHERE  email LIKE 'a' || '%'
   GROUP  BY user_id
   ORDER  BY count(*) DESC, user_id  -- break ties
   LIMIT  2  OFFSET 0
   ) sq
JOIN   emails e USING (user_id)
JOIN   users  u ON u.id = e.user_id
WHERE  e.email LIKE 'a' || '%'
ORDER  BY sq.n_emails DESC;

Include users without emails

You can either include the users table in the inner query again, similar to what you had before. But you have to pull the filter on email into the join condition!

SELECT u.name, e2.*
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT u.id AS user_id, u.name, e.id AS e_id
           , count(e.user_id) OVER (PARTITION BY u.id) AS n_emails          
      FROM   users u
      LEFT   JOIN emails e ON e.user_id = u.id
                          AND e.email LIKE 'a' || '%'  -- !!!
      ) e1
   ) e2
WHERE  rnk < 3
ORDER  BY rnk;

Which will be a bit more expensive.

Since you retrieve users with the most emails first, users without emails will rarely be in the result. To optimize performance, you could use a UNION ALL with LIMIT:

(  -- parentheses required
SELECT u.name, e2.user_id, e2.e_id, e2.e_mail, e2.n_emails
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT user_id, id AS e_id, e_mail
           , count(*) OVER (PARTITION BY user_id) AS n_emails          
      FROM   emails
      WHERE  email LIKE 'a' || '%'  -- one % is enough
      ) e1
   ) e2
JOIN   users u ON u.id = e2.user_id
WHERE  rnk < 3      -- adapt to paging!
ORDER  BY rnk
)
UNION ALL
(    
SELECT u.name, u.user_id, NULL AS e_id, NULL AS e_mail, 0 AS n_emails  
FROM   users       u
LEFT   JOIN emails e ON e.user_id = u.id
                    AND e.email LIKE 'a' || '%'
WHERE  e.e.user_id IS NULL
)
OFFSET 0      -- adapt to paging!
LIMIT  2      -- adapt to paging!

Detailed explanation:

Optimize a query on two big tables

`MATERIALIZED VIEW`

I would consider materializing the result for two reasons:

Subsequent queries are much faster.
You don't have to operate on a moving target. You speak of paging, and if users get new emails between pages, your whole sort order may be moot.

Build a MV from the 2nd query without LIMIT (REFRESH MATERIALIZED VIEW), then return the first page etc. It's a matter of policy, when you refresh the MV again.

Postgresql – Postgres partition tables 1 million rows

Your database is small and doesn't require partitioning.

A quick guesstimate gives storage requirements: two 2,000 byte columns times 1M rows is 4GB; times 3-4M rows is 12-16GB. A proper calculation would include a fudge factor for the other columns, indices, and other overheads, but it's still obviously an amount that fits in RAM on anything but the most crusty of servers.

So to answer your questions:

You can partition a 1M row table, but it's not worth the effort with PostgreSQL. (PostgreSQL is not MySQL.)
This is not a question!
PostgreSQL's "about" page states there is no limit on the number of rows, but a table cannot exceed 32TB.

Best Answer

Related Solutions

PostgreSQL – Sort by Number of Related Rows in Referencing Table