I have two tables like this:
CREATE TABLE cmap5 (
name varchar(2000),
lexemes tsquery
);
and
CREATE TABLE IF NOT EXISTS synonyms_all_gin_tsvcolumn (
cid int NOT NULL, -- REFERENCES pubchem_compounds_index(cid)
name varchar(2000) NOT NULL,
synonym varchar(2000) NOT NULL,
tsv_syns tsvector,
PRIMARY KEY (cid, name, synonym)
);
My current query is:
SELECT s.cid, s.synonym, c.name, ts_rank(s.tsv_syns,c.lexemes,16)
FROM synonyms_all_gin_tsvcolumn s, cmap5 c
WHERE c.lexemes @@ s.tsv_syns
And the output is:
cid | synonym | name (query) | rank
5474706 | 10-Methoxyharmalan | 10-methoxyharmalan | 0.0901673
1416 | (+/-)12,13-EODE | 12,13-EODE | 0.211562
5356421 | LEUKOTOXIN B (12,13-EODE) | 12,13-EODE | 0.211562
180933 | 1,4-Chrysenequinone | 1,4-chrysenequinone | 0.211562
5283035 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-delta 12 14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-Delta(12,14)-prostaglandin J2| 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta 12, 14-Prostaglandin J2| 15-delta prostaglandin J2 | 0.304975
I would like to return the name matches of all rows in cmap5
in my main table ranked by the ts_rank()
function but for each row in cmap5
I want to:
- select only the best X
cid
s to each query (group by cid
) - or
ORDER BY my results as 1+ts_rank/count(cid)
To get the best match I tried to add select distinct on c.name
, but when the rank is the same I want to get the cid
with more matches to the query. I have tried adding a simple group by at the end of the query but I get an error, how could I do this?
Added comments:
On one hand for those results whose rank is the same, eg. above 5283035
and 5311211
, get 5311211
as top result because that cid
has more hits than 5283035
, so I sort of want to take into account the number of hits / cid
in the rank, like final_rank = 1+ts_rank(cid)/no. of hits(cid).
On the other hand I want to get the first X
cids per query name. If I use LIMIT X
it returns the first X
results of the entire query table, not the first X
per name (row) of the query table as I want.
Best Answer
First off, your
PRIMARY KEY
spanning twovarchar(2000)
columns seems extremely expensive. If you use your PK for anything else I suggest a surrogate PK (use aserial
column) and add aUNIQUE
constraint to enforce uniqueness on(cid, name, synonym)
.If one of your
varchar
columns actually uses the maximum length you would exceed the maximum size for an index entry. See:I guess what you want is this, because it would make sense:
I use explicit
[INNER] JOIN
with attached join condition replacing yourCROSS JOIN
plusWHERE
clause. It's generally considered superior (easier to read and debug). I also usernk
as column name to avoid the basic function namerank
as identifier.Group the results per
c.name
that have the samernk
ands.cid
, takemin(s.synonym)
(for lack of definition in the question), andcount(*)
the peers per group.Reduce to one row per
c.name
withDISTINCT ON
(Postgres specific extension of SQL standardDISTINCT
), taking the highest rank first and, within same rank, the highest peer count. See:Select first row in each GROUP BY group?
Combining
GROUP BY
andDISTINCT ON
this way in one query level is possible sinceDISTINCT
orDISTINCT ON
are applied afterGROUP BY
.