I'm using PostgreSQL 8.4, but would like a standard SQL solution if possible.
Consider the following table.
corrmodel=# SELECT * from data limit 1;
id | datasubgroup_id | datafile_id | sequence | index | seqindex | margstat | pvalue
---------+-----------------+-------------+------------------------------+-------+----------+----------+--------
1033473 | 3 | 10 | GGTGACCCCAAGCTCAGGGCTGACCTGC | 19042 | | 70.7634 | 0
I want to return the query that has the following properties.
- All rows with the same
datafile_id
andindex
are grouped
together. - The groups are sorted first by average
pvalue
descending, then by average
margstat
, where the averages are across each group.
The two tables I am doing this query on have 2.2 million and 3.1 million rows, so I'd like something reasonably efficient. Each group consists of 5 rows. This solution by @Lamak
works, but I had some trouble wrapping my head around it, and I think that a solution using window functions might be something I could actually understand. The following is close, but not correct, since the group is not preserved in this case.
SELECT datafile_id,
index,
pvalue,
margstat,
Avg(pvalue)
OVER (
partition BY datafile_id, index) AS avg_pval,
Avg(margstat)
OVER (
partition BY datafile_id, index) AS avg_margstat
FROM data
ORDER BY avg_pval DESC,
avg_margstat;
Here is the first 10 rows of the query result for one of my data sets. I'd like something like this, but correct.
datafile_id | index | pvalue | margstat | avg_pval | avg_margstat
-------------+-------+--------+----------+----------+--------------
30 | 781 | 1 | 13.1568 | 0.998 | 12.52546
30 | 781 | 1 | 12.3585 | 0.998 | 12.52546
30 | 781 | 1 | 12.3495 | 0.998 | 12.52546
30 | 781 | 0.99 | 11.9554 | 0.998 | 12.52546
30 | 781 | 1 | 12.8071 | 0.998 | 12.52546
23 | 1428 | 0.99 | 12.1711 | 0.998 | 12.6777
23 | 1428 | 1 | 12.6451 | 0.998 | 12.6777
23 | 1428 | 1 | 12.8814 | 0.998 | 12.6777
23 | 1428 | 1 | 12.8969 | 0.998 | 12.6777
23 | 1428 | 1 | 12.794 | 0.998 | 12.6777
Best Answer
As @ypercube pointed out in the comments, my query is quite close to the correct answer. The sort by
avg_pval DESC, avg_margstat
is actually close to the correct sort, only incorrect if the(avg_pval, margstat)
tuple happens to have ties. So one can sort again, for a fixed(avg_pval, margstat)
, ondatafile_id, index
which will bring the groups back together. Finally, one can optionally sort within the groups, bypvalue DESC, margstat
, Putting that all together, one gets