Ok, so my answer here is conditioned on the following assumptions:
- This categorization needs to be relatively stable and
- It may be re-used in various ways, and
- It needs to perform well.
The obvious, simple answer is to use table methods in PostgreSQL. What you do is create a SQL language function which does not hit the table and returns the value you want. Make it immutable so you can index the output if you want to query against it, etc. Note the specific rules aren't really clear from your post but this should get you started.
CREATE OR REPLACE FUNCTION subcategory(task) -- task is table name
RETURNS text LANGUAGE sql IMMUTABLE AS
$$
SELECT CASE WHEN $1.type = 3 THEN
CASE WHEN $I.reference LIKE 'IX#%'
THEN 'Company IX Chargeable'
ELSE 'Other Chargeable'
END
WHEN $1.reference LIKE 'IX#%' THEN
CASE WHEN $1.type = 5 THEN 'Company IX Rebool'
ELSE 'Company IX Other'
END
WHEN $1.reference LIKE 'THR#%' AND $1.type IN (1, 2) THEN
CASE WHEN $1.type = 2 THEN 'THR Return to Base'
WHEN $1.type = 1 THEN 'THR On Site'
END
WHEN $1.type = 2 THEN 'Return to Base'
ELSE 'Other Nonchargeable'
END;
$$;
You can then query this using:
SELECT t.subcategory FROM task t;
Note that the table name is non-optional here. The parser converts this to:
SELECT subcategory(t) FROM task t;
You can use this however in any part of the select statement including the where clause. If it proves slow, you can add cost estimates (but I think this should be fast), and you can even index the output using PostgreSQL's functional indexes.
On the performance side, LIKE and substring() don't seem to be significantly different:
postgres=# select count(*) from generate_series(1, 10000000);
count
----------
10000000
(1 row)
Time: 9007.618 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE s::text like '1%';
count
---------
1111112
(1 row)
Time: 13653.000 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE substring(s::text from 1 for 1) = '1';
count
---------
1111112
(1 row)
Time: 16681.860 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE s::text like '1%';
count
---------
1111112
(1 row)
Time: 17163.470 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE substring(s::text from 1 for 1) = '1';
count
---------
1111112
(1 row)
Time: 17052.004 ms
So sometimes one is faster than the other, but they don't seem out of line with eachother.
This is the relational division problem and there is a question about it at SO, with a lot of ways to write this query, plus performance analysis for PostgreSQL: How to filter SQL results in a has-many-through relation
Shamelessly copying code form there and removing/changing code for answers that have features lacking from MySQL, like CTEs, EXCEPT
, INTERSECT
, etc, here are a few ways to do this.
Assumptions:
- the table is called
factors
- there is a
UNIQUE
constraint on (wordid, docid)
- there is a
documents
and a words
table:
Easy to write, medium efficiency:
-- Query 1 -- by Martin
SELECT d.docid, d.docname
FROM document d
JOIN factors f USING (docid)
WHERE f.wordid IN (2, 4, 5)
GROUP BY d.docid
HAVING COUNT(*) = 3 ; -- number of words
Easy to write, medium efficiency:
-- Query 2 -- by Erwin
SELECT d.docid, d.docname
FROM documents d
JOIN (
SELECT docid
FROM factors
WHERE wordid IN (2, 4, 5)
GROUP BY docid
HAVING COUNT(*) = 3
) f USING (docid) ;
More complex to write, very good efficiency in Postgres - probably lousy in MySQL:
-- Query 4 -- by Derek
SELECT d.docid, d.docname
FROM documents d
WHERE d.docid IN (SELECT docid FROM factors WHERE wordid = 2)
AND d.docid IN (SELECT docid FROM factors WHERE wordid = 4);
AND d.docid IN (SELECT docid FROM factors WHERE wordid = 5);
More complex to write, very good efficiency in Postgres - and probably the same in MySQL:
-- Query 5 -- by Erwin
SELECT d.docid, d.docname
FROM documents d
WHERE EXISTS (SELECT * FROM factors
WHERE docid = d.docid AND wordid = 2)
AND EXISTS (SELECT * FROM factors
WHERE docid = d.docid AND wordid = 4)
AND EXISTS (SELECT * FROM factors
WHERE docid = d.docid AND wordid = 5) ;
More complex to write, very good efficiency in Postgres - and probably the same in MySQL:
-- Query 6 -- by Sean
SELECT d.docid, d.docname
FROM documents d
JOIN factors x ON d.docid = x.docid
JOIN factors y ON d.docid = y.docid
JOIN factors z ON d.docid = z.docid
WHERE x.wordid = 2
AND y.wordid = 4
AND z.wordid = 5 ;
Easy to write and extend to an arbitrary set of words
but not as efficient as the JOIN
and EXISTS
solutions:
-- Query 7 -- by ypercube
SELECT d.docid, d.docname
FROM documents d
WHERE NOT EXISTS (
SELECT *
FROM words AS w
WHERE w.wordid IN (2, 4, 5)
AND NOT EXISTS (
SELECT *
FROM factors AS f
WHERE f.docid = d.docid
AND f.wordid = w.wordid
)
);
Easy to write, not good efficiency:
-- Query 8 -- by ypercube
SELECT d.docid, d.docname
FROM documents d
WHERE NOT EXISTS (
SELECT *
FROM (
SELECT 2 AS wordid UNION ALL
SELECT 4 UNION ALL
SELECT 5
) AS w
WHERE NOT EXISTS (
SELECT *
FROM factors AS f
WHERE f.docid = d.docid
AND f.wordid = w.wordid
)
);
Enjoy testing them :)
Best Answer
An
EXISTS
semi-join should be the fastest way to arrive atDISTINCT
rows, because you don't get duplicates to begin with. Also, in the event of multiple matching rows (many matching tasks withing 30 days) evaluation can stop after the first match is found. I would expect this query to beat anything you have so far. Test withEXPLAIN ANALYZE
.In a
WHERE
clause orJOIN
condition don't use an expression like:If you can avoid it. As the left hand expression is derived from values of two tables, the only way for the query planner is to form a limited
CROSS JOIN
(after applying other conditions) and compute a value for every possible combination. This is a well known anti-pattern for good performance!As long as there are only few rows per
reference
, it won't matter much. But the cost grows with O(N²) with more rows perreference
. I rewrote the condition to:In my tests, the first form went from 5x slower to 500x slower quickly when I narrowed down the id-space for
reference
(-> more matching rows).The second form can also more easily use indexes. If you want to select a small sample of the table, with conditions like:
And an index like:
The performance gap becomes even more overwhelming.