Postgresql – sorting groups of related rows by average values while keeping the groups together

postgresqlsortingwindow functions

I'm using PostgreSQL 8.4, but would like a standard SQL solution if possible.
Consider the following table.

corrmodel=# SELECT * from data limit 1;
   id    | datasubgroup_id | datafile_id |           sequence           | index | seqindex | margstat | pvalue 
---------+-----------------+-------------+------------------------------+-------+----------+----------+--------
 1033473 |               3 |          10 | GGTGACCCCAAGCTCAGGGCTGACCTGC | 19042 |          |  70.7634 |      0

I want to return the query that has the following properties.

  1. All rows with the same datafile_id and index are grouped
    together.
  2. The groups are sorted first by average pvalue descending, then by average
    margstat, where the averages are across each group.

The two tables I am doing this query on have 2.2 million and 3.1 million rows, so I'd like something reasonably efficient. Each group consists of 5 rows. This solution by @Lamak
works, but I had some trouble wrapping my head around it, and I think that a solution using window functions might be something I could actually understand. The following is close, but not correct, since the group is not preserved in this case.

SELECT datafile_id, 
       index, 
       pvalue, 
       margstat, 
       Avg(pvalue) 
         OVER ( 
           partition BY datafile_id, index) AS avg_pval, 
       Avg(margstat) 
         OVER ( 
           partition BY datafile_id, index) AS avg_margstat 
FROM   data 
ORDER  BY avg_pval DESC, 
          avg_margstat; 

Here is the first 10 rows of the query result for one of my data sets. I'd like something like this, but correct.

datafile_id | index | pvalue | margstat | avg_pval | avg_margstat 
-------------+-------+--------+----------+----------+--------------
          30 |   781 |      1 |  13.1568 |    0.998 |     12.52546
          30 |   781 |      1 |  12.3585 |    0.998 |     12.52546
          30 |   781 |      1 |  12.3495 |    0.998 |     12.52546
          30 |   781 |   0.99 |  11.9554 |    0.998 |     12.52546
          30 |   781 |      1 |  12.8071 |    0.998 |     12.52546
          23 |  1428 |   0.99 |  12.1711 |    0.998 |      12.6777
          23 |  1428 |      1 |  12.6451 |    0.998 |      12.6777
          23 |  1428 |      1 |  12.8814 |    0.998 |      12.6777
          23 |  1428 |      1 |  12.8969 |    0.998 |      12.6777
          23 |  1428 |      1 |   12.794 |    0.998 |      12.6777

Best Answer

As @ypercube pointed out in the comments, my query is quite close to the correct answer. The sort by avg_pval DESC, avg_margstat is actually close to the correct sort, only incorrect if the (avg_pval, margstat) tuple happens to have ties. So one can sort again, for a fixed (avg_pval, margstat), on datafile_id, index which will bring the groups back together. Finally, one can optionally sort within the groups, by pvalue DESC, margstat, Putting that all together, one gets

SELECT datafile_id, 
       index, 
       pvalue, 
       margstat, 
       Avg(pvalue) 
         OVER ( 
           partition BY datafile_id, index) AS avg_pval, 
       Avg(margstat) 
         OVER ( 
           partition BY datafile_id, index) AS avg_margstat 
FROM   data 
ORDER  BY avg_pval DESC, 
          avg_margstat, 
          datafile_id,
          index,
          pvalue DESC, 
          margstat;