This is the relational division problem and there is a question about it at SO, with a lot of ways to write this query, plus performance analysis for PostgreSQL: How to filter SQL results in a has-many-through relation
Shamelessly copying code form there and removing/changing code for answers that have features lacking from MySQL, like CTEs, EXCEPT
, INTERSECT
, etc, here are a few ways to do this.
Assumptions:
- the table is called
factors
- there is a
UNIQUE
constraint on (wordid, docid)
- there is a
documents
and a words
table:
Easy to write, medium efficiency:
-- Query 1 -- by Martin
SELECT d.docid, d.docname
FROM document d
JOIN factors f USING (docid)
WHERE f.wordid IN (2, 4, 5)
GROUP BY d.docid
HAVING COUNT(*) = 3 ; -- number of words
Easy to write, medium efficiency:
-- Query 2 -- by Erwin
SELECT d.docid, d.docname
FROM documents d
JOIN (
SELECT docid
FROM factors
WHERE wordid IN (2, 4, 5)
GROUP BY docid
HAVING COUNT(*) = 3
) f USING (docid) ;
More complex to write, very good efficiency in Postgres - probably lousy in MySQL:
-- Query 4 -- by Derek
SELECT d.docid, d.docname
FROM documents d
WHERE d.docid IN (SELECT docid FROM factors WHERE wordid = 2)
AND d.docid IN (SELECT docid FROM factors WHERE wordid = 4);
AND d.docid IN (SELECT docid FROM factors WHERE wordid = 5);
More complex to write, very good efficiency in Postgres - and probably the same in MySQL:
-- Query 5 -- by Erwin
SELECT d.docid, d.docname
FROM documents d
WHERE EXISTS (SELECT * FROM factors
WHERE docid = d.docid AND wordid = 2)
AND EXISTS (SELECT * FROM factors
WHERE docid = d.docid AND wordid = 4)
AND EXISTS (SELECT * FROM factors
WHERE docid = d.docid AND wordid = 5) ;
More complex to write, very good efficiency in Postgres - and probably the same in MySQL:
-- Query 6 -- by Sean
SELECT d.docid, d.docname
FROM documents d
JOIN factors x ON d.docid = x.docid
JOIN factors y ON d.docid = y.docid
JOIN factors z ON d.docid = z.docid
WHERE x.wordid = 2
AND y.wordid = 4
AND z.wordid = 5 ;
Easy to write and extend to an arbitrary set of words
but not as efficient as the JOIN
and EXISTS
solutions:
-- Query 7 -- by ypercube
SELECT d.docid, d.docname
FROM documents d
WHERE NOT EXISTS (
SELECT *
FROM words AS w
WHERE w.wordid IN (2, 4, 5)
AND NOT EXISTS (
SELECT *
FROM factors AS f
WHERE f.docid = d.docid
AND f.wordid = w.wordid
)
);
Easy to write, not good efficiency:
-- Query 8 -- by ypercube
SELECT d.docid, d.docname
FROM documents d
WHERE NOT EXISTS (
SELECT *
FROM (
SELECT 2 AS wordid UNION ALL
SELECT 4 UNION ALL
SELECT 5
) AS w
WHERE NOT EXISTS (
SELECT *
FROM factors AS f
WHERE f.docid = d.docid
AND f.wordid = w.wordid
)
);
Enjoy testing them :)
In Postgres, this is simpler with DISTINCT ON
:
SELECT *
FROM (
SELECT DISTINCT ON (sec)
id, sec
FROM tbl
ORDER BY sec, id DESC
) sub
ORDER BY id DESC
LIMIT 4;
Detailed explanation in this related answer on SO:
For a big table and small LIMIT
, neither this nor @a_horse's solution are very efficient. The subquery will plough through the whole table, wasting a lot of time ...
Recursive CTE
I have tried and failed to solve similar problems with a recursive CTE in the past and resorted to a procedural solution with PL/pgSQL. Example:
Finally, here is a working rCTE:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, '{}'::int[] AS last_arr, ARRAY[sec] AS arr
FROM tbl
ORDER BY id DESC
LIMIT 1
)
UNION ALL
(
SELECT b.id, c.arr
, CASE WHEN b.sec = ANY (c.arr) THEN c.arr ELSE b.sec || c.arr END
FROM cte c
JOIN tbl b ON b.id < c.id
WHERE array_length(c.arr, 1) < 4
ORDER BY id DESC
LIMIT 1
)
)
SELECT id, arr[1] AS sec
FROM cte
WHERE last_arr <> arr;
It's not as fast or elegant as I had hoped for and not nearly as fast as the function below, but faster than the query above in my tests.
PL/pgSQL function
Fastest by far:
CREATE OR REPLACE FUNCTION f_first_uniq(_rows int)
RETURNS TABLE (id int, sec int) AS
$func$
DECLARE
_arr int[];
BEGIN
FOR id, sec IN
SELECT t.id, t.sec FROM tbl t ORDER BY t.id DESC
LOOP
IF sec = ANY (_arr) THEN
-- do nothing
ELSE
RETURN NEXT;
_arr := _arr || sec;
EXIT WHEN array_length(_arr, 1) >= _rows;
END IF;
END LOOP;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_first_uniq(4);
SQL Fiddle demonstrating all three.
Could be made out to work for any table with table and column names as parameters and dynamic SQL with EXECUTE
...
Why bother?
In a test table with only 30k
rows the function ran 2000x faster than the above query (which already ran ~ 30% faster than a_horse's version). This difference grows with the size of the table. Performance of the function is about constant, while the query's performance gets progressively worse, since it tries to find distinct values in all of the table first. Try this in a table with a million rows ...
Best Answer
Edit
To get only the ones where done is found twice and no other statuses are found:
DB-Fiddle
Or
DB-Fiddle
The reasoning for adding it to the group by or applying a function:
if
tb
ortbid
can be different for the same 'done'status
and the samettid
, which one do you choose? Do you choose the smallest, the biggest, ...?An example of what happens if this is true and it is not added to a
group by
or function:DB-Fiddle that fails