Postgresql – GROUP BY and ORDER BY problem

aggregatefull-text-searchgreatest-n-per-groupgroup bypostgresql

I have two tables like this:

CREATE TABLE cmap5 (
   name     varchar(2000),
   lexemes  tsquery
);

and

CREATE TABLE IF NOT EXISTS synonyms_all_gin_tsvcolumn (
   cid       int NOT NULL,  -- REFERENCES pubchem_compounds_index(cid)
   name      varchar(2000) NOT NULL,  
   synonym   varchar(2000) NOT NULL,
   tsv_syns  tsvector,
   PRIMARY KEY (cid, name, synonym)
);

My current query is:

SELECT s.cid, s.synonym, c.name, ts_rank(s.tsv_syns,c.lexemes,16) 
FROM synonyms_all_gin_tsvcolumn s, cmap5 c
WHERE c.lexemes @@ s.tsv_syns

And the output is:

cid     |  synonym                              | name (query)              | rank
5474706 | 10-Methoxyharmalan                    | 10-methoxyharmalan        | 0.0901673
1416    | (+/-)12,13-EODE                       | 12,13-EODE                | 0.211562
5356421 | LEUKOTOXIN B (12,13-EODE)             | 12,13-EODE                | 0.211562
 180933 | 1,4-Chrysenequinone                   | 1,4-chrysenequinone       | 0.211562
5283035 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-delta 12 14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-Delta(12,14)-prostaglandin J2| 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta 12, 14-Prostaglandin J2| 15-delta prostaglandin J2 | 0.304975

I would like to return the name matches of all rows in cmap5 in my main table ranked by the ts_rank() function but for each row in cmap5 I want to:

  • select only the best X cids to each query (group by cid)
  • or ORDER BY my results as 1+ts_rank/count(cid)

To get the best match I tried to add select distinct on c.name, but when the rank is the same I want to get the cid with more matches to the query. I have tried adding a simple group by at the end of the query but I get an error, how could I do this?

Added comments:

On one hand for those results whose rank is the same, eg. above 5283035 and 5311211, get 5311211 as top result because that cid has more hits than 5283035, so I sort of want to take into account the number of hits / cid in the rank, like final_rank = 1+ts_rank(cid)/no. of hits(cid).

On the other hand I want to get the first X cids per query name. If I use LIMIT X it returns the first X results of the entire query table, not the first X per name (row) of the query table as I want.

Best Answer

First off, your PRIMARY KEY spanning two varchar(2000) columns seems extremely expensive. If you use your PK for anything else I suggest a surrogate PK (use a serial column) and add a UNIQUE constraint to enforce uniqueness on (cid, name, synonym).

If one of your varchar columns actually uses the maximum length you would exceed the maximum size for an index entry. See:

I guess what you want is this, because it would make sense:

SELECT DISTINCT ON (c.name)
       c.name, min(s.synonym) AS min_synonym, s.cid
     , ts_rank(s.tsv_syns, c.lexemes, 16) AS rnk
     , count(*) AS ct
FROM   synonyms_all_gin_tsvcolumn s
JOIN   cmap5                      c ON c.lexemes @@ s.tsv_syns
GROUP  BY c.name, rnk, s.cid
ORDER  BY c.name, rnk DESC, ct DESC;
  • I use explicit [INNER] JOIN with attached join condition replacing your CROSS JOIN plus WHERE clause. It's generally considered superior (easier to read and debug). I also use rnk as column name to avoid the basic function name rank as identifier.

  • Group the results per c.name that have the same rnk and s.cid, take min(s.synonym) (for lack of definition in the question), and count(*) the peers per group.

  • Reduce to one row per c.name with DISTINCT ON (Postgres specific extension of SQL standard DISTINCT), taking the highest rank first and, within same rank, the highest peer count. See:

  • Select first row in each GROUP BY group?

Combining GROUP BY and DISTINCT ON this way in one query level is possible since DISTINCT or DISTINCT ON are applied after GROUP BY.