MySQL Select Items Where Single Column Contains Multiple Values

MySQLselect

I have a MySQL database table that references different words and their locations in documents. I want to return the IDs of documents that contain all of the words.

Here is an example table.

docid     wordid
1         4
2         4
1         2
1         5

Ok, now say that someone queried the database for the words that had WORDIDs 4, 2, and 5.

My erroneous SQL SELECT statement would be something like:

Select docid from table where wordid = 4 and wordid = 2 and wordid = 5

This is giving me 0 results.

I have seen in other places where the where in clause has been suggested:

If I understand correctly, this is another way to write an OR clause. I have tried this:

select docid from table where wordid in (4,2,5)

But, this is giving me all the results. It should exclude docid 2 as that does not contain the other words. I'm expecting to just get docid 1.

However, I could be using the where in clause incorrectly as I have very little db experience.

How can I return docids that contain all of the words?

Please note as well, my where clause will be dynamically generated in a FOR loop. The query could be as simple as one or two words, or it could be 10 or 12 words. I'm looking for a query structure that takes speed into consideration. Please let me know if you need anymore information.

For reference, I am trying to convert this code into PHP / MYSQL, but I don't understand the sql statement here or its equivalent in MYSQL:

http://my.safaribooksonline.com/book/web-development/9780596529321/4dot-searching-and-ranking/querying

Best Answer

This is the relational division problem and there is a question about it at SO, with a lot of ways to write this query, plus performance analysis for PostgreSQL: How to filter SQL results in a has-many-through relation

Shamelessly copying code form there and removing/changing code for answers that have features lacking from MySQL, like CTEs, EXCEPT, INTERSECT, etc, here are a few ways to do this.

Assumptions:

  • the table is called factors
  • there is a UNIQUE constraint on (wordid, docid)
  • there is a documents and a words table:

Easy to write, medium efficiency:

-- Query 1 -- by Martin
SELECT d.docid, d.docname
FROM   document d
JOIN   factors f USING (docid)
WHERE  f.wordid IN (2, 4, 5)
GROUP  BY d.docid
HAVING COUNT(*) = 3 ;           -- number of words

Easy to write, medium efficiency:

-- Query 2 -- by Erwin
SELECT d.docid, d.docname
FROM   documents d
JOIN   (
   SELECT docid
   FROM   factors
   WHERE  wordid IN (2, 4, 5)
   GROUP  BY docid
   HAVING COUNT(*) = 3
   ) f USING (docid) ;

More complex to write, very good efficiency in Postgres - probably lousy in MySQL:

-- Query 4 -- by Derek
SELECT d.docid, d.docname
FROM   documents d
WHERE  d.docid IN (SELECT docid FROM factors WHERE wordid = 2)
AND    d.docid IN (SELECT docid FROM factors WHERE wordid = 4);
AND    d.docid IN (SELECT docid FROM factors WHERE wordid = 5);

More complex to write, very good efficiency in Postgres - and probably the same in MySQL:

-- Query 5 -- by Erwin
SELECT d.docid, d.docname
FROM   documents d
WHERE  EXISTS (SELECT * FROM factors 
               WHERE  docid = d.docid AND wordid = 2)
AND    EXISTS (SELECT * FROM factors 
               WHERE  docid = d.docid AND wordid = 4)
AND    EXISTS (SELECT * FROM factors 
               WHERE  docid = d.docid AND wordid = 5) ;

More complex to write, very good efficiency in Postgres - and probably the same in MySQL:

-- Query 6 -- by Sean
SELECT d.docid, d.docname
FROM   documents d
JOIN   factors x ON d.docid = x.docid
JOIN   factors y ON d.docid = y.docid
JOIN   factors z ON d.docid = z.docid
WHERE  x.wordid = 2
AND    y.wordid = 4
AND    z.wordid = 5 ;

Easy to write and extend to an arbitrary set of words but not as efficient as the JOIN and EXISTS solutions:

-- Query 7 -- by ypercube
SELECT d.docid, d.docname
FROM   documents d
WHERE  NOT EXISTS (
   SELECT *
   FROM   words AS w 
   WHERE  w.wordid IN (2, 4, 5)
   AND    NOT EXISTS (
      SELECT *
      FROM   factors AS f 
      WHERE  f.docid = d.docid 
      AND    f.wordid = w.wordid 
      )
   );

Easy to write, not good efficiency:

-- Query 8 -- by ypercube
SELECT d.docid, d.docname
FROM   documents d
WHERE  NOT EXISTS (
   SELECT *
   FROM  (
      SELECT 2 AS wordid UNION  ALL
      SELECT 4 UNION ALL
      SELECT 5
      ) AS w
   WHERE NOT EXISTS (
      SELECT *
      FROM   factors AS f 
      WHERE  f.docid = d.docid 
      AND    f.wordid = w.wordid 
      )
   );

Enjoy testing them :)