Mysql – Implementing a query, two types of strategies

distinctMySQLperformancequery-performancesubquery

Tables have not been created yet, but to simplify there is a groups and an items tables. A group contains an item id (and a group id). And several groups may contain the same item id.

For instance

CREATE TABLE groups (
          gid int,
          iid int,
          primary key(gid,iid)
);
CREATE TABLE items (
          iid int primary key,
          label char(5)
);

and let put some data in these, to help visualize the coming problem

insert into items values (1,"pen"),(2,"gum"),(3,"cat"),(4,"dog"),
                         (5,"hug"),(6,"art"),(7,"fun");

insert into groups values(1,1),(1,2),(1,3),(2,4),(2,5),(2,6),(3,1),
                         (3,4),(4,2),(4,3),(4,5),(4,6),(4,7);

Retrieving items of multiple groups (e.g. groups 1,2,3,4) uses a DISTINCT to remove duplicates

(1) SELECT DISTINCT i.iid,i.label FROM groups g JOIN items i ON g.iid=i.iid 
    WHERE g.gid in (1,2,3,4) ORDER BY i.iid;

But since the tables will eventually contain more data, would that query using a subquery be more efficient

(2) SELECT i.iid,i.label FROM items i 
    JOIN (SELECT DISTINCT iid FROM groups WHERE gid IN (1,2,3,4)) AS s
    ON i.iid=s.iid ORDER BY i.iid;

since distinct applies only to the items ids, despite the overhead of using a subquery?

Or this one suggested by @hypercube (see comments):

(3) SELECT i.iid,i.label FROM items AS i 
    WHERE EXISTS
       (SELECT * FROM groups AS g WHERE g.gid IN (1,2,3,4) AND i.iid=g.iid)
    ORDER BY i.iid

For information, eventually

  • the items table will have between 50~100k rows
  • a group will likely have 5~10k items
  • groups ids to be selected in a single query are in the 5~20 range
  • probability that an item is part of 2 selected groups: 50%
  • probability that an item is part of 3 selected groups: 30%
  • probability that an item is part of 4 selected groups: 10%

Best Answer

(1) is simple and straightforward. It may be the most efficient.

(2) is a common speed-up. But it shines when the non-indexed part (just i.label in your case) is bulky and/or the key into the table (i.id) is not the PRIMARY KEY. So, if your example is watered down from the real code, (2) may outperform (1).

(3) is unlikely to be efficient, since it needs to reach into items 50~100K times.

Side issues:

Don't use CHAR unless the column really is fixed length; instead, use VARCHAR.

I assume you are using InnoDB. (MyISAM, because of different handling of PRIMARY KEY, will be less efficient.)

If groups is a many:many mapping table, see my tips. It will probably help (3) some.