Postgresql – GROUP BY and ORDER BY problem

aggregatefull-text-searchgreatest-n-per-groupgroup bypostgresql

I have two tables like this:

CREATE TABLE cmap5 (
   name     varchar(2000),
   lexemes  tsquery
);

and

CREATE TABLE IF NOT EXISTS synonyms_all_gin_tsvcolumn (
   cid       int NOT NULL,  -- REFERENCES pubchem_compounds_index(cid)
   name      varchar(2000) NOT NULL,  
   synonym   varchar(2000) NOT NULL,
   tsv_syns  tsvector,
   PRIMARY KEY (cid, name, synonym)
);

My current query is:

SELECT s.cid, s.synonym, c.name, ts_rank(s.tsv_syns,c.lexemes,16) 
FROM synonyms_all_gin_tsvcolumn s, cmap5 c
WHERE c.lexemes @@ s.tsv_syns

And the output is:

cid     |  synonym                              | name (query)              | rank
5474706 | 10-Methoxyharmalan                    | 10-methoxyharmalan        | 0.0901673
1416    | (+/-)12,13-EODE                       | 12,13-EODE                | 0.211562
5356421 | LEUKOTOXIN B (12,13-EODE)             | 12,13-EODE                | 0.211562
 180933 | 1,4-Chrysenequinone                   | 1,4-chrysenequinone       | 0.211562
5283035 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-delta 12 14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-Delta(12,14)-prostaglandin J2| 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta 12, 14-Prostaglandin J2| 15-delta prostaglandin J2 | 0.304975

I would like to return the name matches of all rows in cmap5 in my main table ranked by the ts_rank() function but for each row in cmap5 I want to:

select only the best X cids to each query (group by cid)
or ORDER BY my results as 1+ts_rank/count(cid)

To get the best match I tried to add select distinct on c.name, but when the rank is the same I want to get the cid with more matches to the query. I have tried adding a simple group by at the end of the query but I get an error, how could I do this?

Added comments:

On one hand for those results whose rank is the same, eg. above 5283035 and 5311211, get 5311211 as top result because that cid has more hits than 5283035, so I sort of want to take into account the number of hits / cid in the rank, like final_rank = 1+ts_rank(cid)/no. of hits(cid).

On the other hand I want to get the first X cids per query name. If I use LIMIT X it returns the first X results of the entire query table, not the first X per name (row) of the query table as I want.

Best Answer

First off, your PRIMARY KEY spanning two varchar(2000) columns seems extremely expensive. If you use your PK for anything else I suggest a surrogate PK (use a serial column) and add a UNIQUE constraint to enforce uniqueness on (cid, name, synonym).

If one of your varchar columns actually uses the maximum length you would exceed the maximum size for an index entry. See:

Character varying index overhead & length limit

I guess what you want is this, because it would make sense:

SELECT DISTINCT ON (c.name)
       c.name, min(s.synonym) AS min_synonym, s.cid
     , ts_rank(s.tsv_syns, c.lexemes, 16) AS rnk
     , count(*) AS ct
FROM   synonyms_all_gin_tsvcolumn s
JOIN   cmap5                      c ON c.lexemes @@ s.tsv_syns
GROUP  BY c.name, rnk, s.cid
ORDER  BY c.name, rnk DESC, ct DESC;

I use explicit [INNER] JOIN with attached join condition replacing your CROSS JOIN plus WHERE clause. It's generally considered superior (easier to read and debug). I also use rnk as column name to avoid the basic function name rank as identifier.
Group the results per c.name that have the same rnk and s.cid, take min(s.synonym) (for lack of definition in the question), and count(*) the peers per group.
Reduce to one row per c.name with DISTINCT ON (Postgres specific extension of SQL standard DISTINCT), taking the highest rank first and, within same rank, the highest peer count. See:
Select first row in each GROUP BY group?

Combining GROUP BY and DISTINCT ON this way in one query level is possible since DISTINCT or DISTINCT ON are applied after GROUP BY.

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

What I still don't understand, is why this is slower.

That sorting the rows will cost something is obvious. But why so much?
Without ORDER BY rank0... Postgres can just

pick the first 5 rows it finds and stop fetching rows as soon as it has 5.

Bitmap Heap Scan on entities ... rows=5 ...
then compute ts_rank() for just 5 rows.

In the second case, Postgres has to

fetch all (1495 according to your query plan) rows that qualify.

Bitmap Heap Scan on entities ... rows=1495 ...
compute ts_rank() for all of them.
sort all of them to find the first 5 according to the calculated value.

Try ORDER BY name just to see the cost of computing to_tsquery('english', 'hockey'::text)) for the superfluous rows and how much remains for fetching more rows and sorting.

Mysql – GROUP BY two columns

Looks like you want something like:

SELECT country
, sum(case when type = 'first' then 1 else 0 end) as type_first
, sum(case when type = 'second' then 1 else 0 end) as type_second
, sum(case when type = 'third' then 1 else 0 end) as type_third
FROM table1
GROUP BY country

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Mysql – GROUP BY two columns

Related Question