Mysql – Optimizing Queries HAVING COUNT and INSERT…ON DUPLICATE KEY UPDATE

duplicationgroup byinnodbMySQLprimary-key

There are two main queries I'm tying to improve, one question is much more conceptual and learning about databases and another is much more straightforward.

Both cases deal with the same table
It has 2 relevant columns as far as the question is concerned, they make up the primary key of the table. Each of the elements of the primary key is hypothetically the key to another table n their own right. The first row is an int (this is an item) an the second row is a VARCHAR (this is a user) limited to 200 characters.

The first issue comes in trying to find an intersection of users. I am looking to find an intersection of users that share a number of items, 5 in the example

SELECT userid 
FROM   (SELECT b.userid, 
               b.showid 
        FROM   ratings AS A 
               JOIN ratings AS B 
                 ON A.userid LIKE :user 
                    AND B.userid NOT LIKE :user 
                    AND A.showid LIKE B.showid) AS C 
GROUP  BY userid 
HAVING Count(*) > 4  

On a fast run this takes up about 2/3 of the runtime of my program (appearing twice, one nested in another query taking little longer than this one). I've tried looking at it but I don't see any obvious way to restructure the query to save time, but queries operating over many more rows are coming up with runtimes of less than 1 second while this takes 2 seconds.

The second issue is more conceptual. As mentioned before I have a composite primary key. I've found elsewhere that with InnoDB tables this can make inserts take up a lot more time. This is exactly whats happening with an INSERT ... UPDATE ON DUPLICATE KEY query that I'm running. What is the best way to go about reducing this slowdown.

This is what the insert looks like, it only runs if the show has been rated

INSERT INTO shows 
            ( 
                        showid, 
                        showname, 
                        showimage 
            ) 
            VALUES 
            ( 
                        :id, 
                        :name, 
                        :image 
            ) 
on duplicate KEY 
UPDATE showimage=:image, 
       showname= :name;
INSERT INTO ratings 
            ( 
                        userid, 
                        showid, 
                        rating 
            ) 
            VALUES 
            ( 
                        :user, 
                        :show, 
                        :rating 
            ) 
on duplicate KEY 
UPDATE rating=:rating

Shows

CREATE TABLE "shows" (
 "ShowId" int(11) NOT NULL,
 "ShowName" varchar(400) NOT NULL,
 "ShowImage" varchar(400) NOT NULL,
 "ShowAgreement" int(11) NOT NULL DEFAULT "0",
 PRIMARY KEY ("ShowId")
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Ratings

CREATE TABLE "ratings" (
 "UserId" varchar(200) NOT NULL,
 "ShowId" int(11) NOT NULL,
 "Rating" int(11) NOT NULL,
 PRIMARY KEY ("UserId","ShowId"),
 KEY "ShowId" ("ShowId"),
 CONSTRAINT "ratings_ibfk_1" FOREIGN KEY ("ShowId") REFERENCES "shows" ("ShowId") ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1

For a single test case the INSERT takes an additional 8 seconds over the UPDATE for 160 entries.

  • Should I simply run the conditional part of the query codeside and removing the primary key.
  • Should I drop the primary key and replace it with a unique constraint?
  • Should I create a new table for the UserId that creates a VARCHAR -> INT identity for userId's and give all existing tables the INT version of the id rather than the CHAR?
  • Should I do something else entirely that hadn't occurred to me?

Best Answer

I'm guessing a bit on what your query is supposed to do. I assume you are looking for all users that share 5 or more items with a certain user:

SELECT a.userid 
FROM ratings AS a
WHERE EXISTS (
      SELECT 1
      FROM ratings AS b
      WHERE b.userid = :user
        AND a.userid <> b.userid
        AND a.showid = b.showid
)
GROUP BY a.userid
HAVING COUNT(1) > 4

Is that about right?

For the second part of the question, as I understand it the tables are defined as:

create table shows
(   showid  int not null primary key
,   showname    varchar(10) not null
,   showimage   varchar(10)
,   x int default 0 not null );

create table ratings
(   userid  varchar(200) not null
,   showid  int not null
        references shows(showid)
,   rating  int not null
,       primary key (showid, userid) );

If that's the case I see no particular reason why the insert in ratings is slow, btw what is slow?

Some ideas for trying to pinpoint the problem, make a copy of the ratings table, say ratings2, drop the foreign key and see how that affects the insert performance