Tables have not been created yet, but to simplify there is a groups
and an items
tables. A group contains an item id (and a group id). And several groups may contain the same item id.
For instance
CREATE TABLE groups (
gid int,
iid int,
primary key(gid,iid)
);
CREATE TABLE items (
iid int primary key,
label char(5)
);
and let put some data in these, to help visualize the coming problem
insert into items values (1,"pen"),(2,"gum"),(3,"cat"),(4,"dog"),
(5,"hug"),(6,"art"),(7,"fun");
insert into groups values(1,1),(1,2),(1,3),(2,4),(2,5),(2,6),(3,1),
(3,4),(4,2),(4,3),(4,5),(4,6),(4,7);
Retrieving items of multiple groups (e.g. groups 1,2,3,4) uses a DISTINCT
to remove duplicates
(1) SELECT DISTINCT i.iid,i.label FROM groups g JOIN items i ON g.iid=i.iid
WHERE g.gid in (1,2,3,4) ORDER BY i.iid;
But since the tables will eventually contain more data, would that query using a subquery be more efficient
(2) SELECT i.iid,i.label FROM items i
JOIN (SELECT DISTINCT iid FROM groups WHERE gid IN (1,2,3,4)) AS s
ON i.iid=s.iid ORDER BY i.iid;
since distinct
applies only to the items ids, despite the overhead of using a subquery?
Or this one suggested by @hypercube (see comments):
(3) SELECT i.iid,i.label FROM items AS i
WHERE EXISTS
(SELECT * FROM groups AS g WHERE g.gid IN (1,2,3,4) AND i.iid=g.iid)
ORDER BY i.iid
For information, eventually
- the items table will have between 50~100k rows
- a group will likely have 5~10k items
- groups ids to be selected in a single query are in the 5~20 range
- probability that an item is part of 2 selected groups: 50%
- probability that an item is part of 3 selected groups: 30%
- probability that an item is part of 4 selected groups: 10%
Best Answer
(1) is simple and straightforward. It may be the most efficient.
(2) is a common speed-up. But it shines when the non-indexed part (just
i.label
in your case) is bulky and/or the key into the table (i.id
) is not thePRIMARY KEY
. So, if your example is watered down from the real code, (2) may outperform (1).(3) is unlikely to be efficient, since it needs to reach into
items
50~100K times.Side issues:
Don't use
CHAR
unless the column really is fixed length; instead, useVARCHAR
.I assume you are using InnoDB. (MyISAM, because of different handling of
PRIMARY KEY
, will be less efficient.)If
groups
is a many:many mapping table, see my tips. It will probably help (3) some.